Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cl_ext_buffer_device_address #1159

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pjaaskel
Copy link
Contributor

@pjaaskel pjaaskel commented May 7, 2024

The basic cl_mem buffer API doesn't enable access to the underlying raw pointers in the device memory, preventing its use in host side data structures that need pointer references to objects. This API adds a minimal increment on top of cl_mem that provides such capabilities.

The version 0.1.0 is implemented in PoCL and rusticl for prototyping, but everything's still up for discussion. chipStar is the first client that uses the API.

@karolherbst
Copy link
Contributor

any motivation to get this merged? Or anything else needed to discuss before merging this? Could also try to bring it up at the CL WG if needed.

@pjaaskel
Copy link
Contributor Author

any motivation to get this merged? Or anything else needed to discuss before merging this? Could also try to bring it up at the CL WG if needed.

Yep, this is still being discussed in the WG. I personally think it's useful as is and shouldn't harm anything if merged as it even has 2 implementations now.

xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
@pjaaskel
Copy link
Contributor Author

Thanks @SunSerega

@SunSerega
Copy link
Contributor

Alright, and now the problem I found in #1171 is visible here because the Presubmit workflow has been properly rerun.

@karolherbst

This comment was marked as resolved.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Sep 4, 2024

One thing I'm wondering about is how should clCreateSubBuffer behave when being executed on a bda memory object? Should it fail or succeed? I'm fine with either, just wondering if some of the behavior needs to be formalized.

CL_MEM_DEVICE_PTR_EXT and CL_MEM_DEVICE_PTRS_EXT comes to my mind, which should probably return the address + the sub buffer offset, at least that's how I have implemented it so far.

Yes, this was the idea. I'll add a mention in the next update.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Sep 9, 2024

Updated according to @karolherbst comments.

Copy link
Contributor

@karolherbst karolherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already implemented the extension fully in rusticl/mesa (including sharing the same address across devices) and I think it's fine, however I'd still urge to address the concerns I have for layered implementations implementing it on top of Vulkan. I've already considered the constraints when implementing it, however I think it's better to provide clients to query if address sharing across multiple devices is supported or not.

extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from b8df46b to 1931416 Compare September 24, 2024 12:26
xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
@pjaaskel
Copy link
Contributor Author

@SunSerega thanks!

@pjaaskel
Copy link
Contributor Author

@karolherbst I asked about this in the CL/memory WG. We need to submit CTS tests and this might be good to go then with this one. Do you have good tests in Rusticl side we could use? The test in PoCL is quite basic (and needs to be updated), but can be used as a starting point also.

@karolherbst
Copy link
Contributor

@karolherbst I asked about this in the CL/memory WG. We need to submit CTS tests and this might be good to go then with this one. Do you have good tests in Rusticl side we could use? The test in PoCL is quite basic (and needs to be updated), but can be used as a starting point also.

I haven't written any more tests and only used the one in pocl. But I can probably help out with writing tests here.

Copy link
Contributor

@kpet kpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposal!

extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from 9ad04fb to edcff73 Compare December 6, 2024 15:53
@pjaaskel
Copy link
Contributor Author

pjaaskel commented Dec 6, 2024

Thanks @kpet for the feedback. Implemented most of it.

@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from edcff73 to c07ff8f Compare December 6, 2024 16:06
@pjaaskel pjaaskel requested review from karolherbst and kpet December 6, 2024 16:06
@pjaaskel
Copy link
Contributor Author

pjaaskel commented Dec 6, 2024

After @karolherbst and @kpet are happy with this, we'll implement in PoCL and @franz will add a CTS pull request. Then we can mark this 1.0.0 and merge, I think.

@karolherbst
Copy link
Contributor

@pjaaskel do you have local changes, because locally I had to change a few things to get things to compile:

diff --git a/xml/cl.xml b/xml/cl.xml
index d7bdb1a..47f52b0 100644
--- a/xml/cl.xml
+++ b/xml/cl.xml
@@ -3736,7 +3736,7 @@ server's OpenCL/api-docs repository.
             <proto><type>cl_int</type>                                  <name>clSetKernelArgDevicePointerEXT</name></proto>
             <param><type>cl_kernel</type>                               <name>kernel</name></param>
             <param><type>cl_uint</type>                                 <name>arg_index</name></param>
-            <param>const <type>void</type>*                             <name>arg_value</name></param>
+            <param>const <type>cl_mem_device_address_ext</type>         <name>arg_value</name></param>
         </command>
         <command suffix="CL_API_SUFFIX__VERSION_2_0">
             <proto><type>cl_int</type>                                  <name>clSetKernelExecInfo</name></proto>
@@ -7200,6 +7200,9 @@ server's OpenCL/api-docs repository.
             </require>
         </extension>
         <extension name="cl_ext_buffer_device_address" revision="0.9.1" supported="opencl" depends="CL_VERSION_3_0" provisional="true">
+            <require>
+                <type name="cl_mem_device_address_ext"/>
+            </require>
             <require>
                 <command name="clSetKernelArgDevicePointerEXT"/>
             </require>
@@ -7212,9 +7215,6 @@ server's OpenCL/api-docs repository.
             <require comment="cl_kernel_exec_info">
                 <enum name="CL_KERNEL_EXEC_INFO_DEVICE_PTRS_EXT"/>
             </require>
-            <require>
-                <type name="cl_mem_device_address_ext"/>
-            </require>
         </extension>
         <extension name="cl_khr_command_buffer" revision="0.9.5" supported="opencl" depends="CL_VERSION_1_2" ratified="opencl" provisional="true">
             <require>

and the CTS PR still uses cl_mem_device_address_EXT not cl_mem_device_address_ext

@pjaaskel
Copy link
Contributor Author

@pjaaskel do you have local changes, because locally I had to change a few things to get things to compile:

I hadn't local changes, but I didn't try to generate the header (still yet find how), just built the docs. Fixed. Pinging @franz about the CTS for when he comes back to office.

@bashbaug
Copy link
Contributor

I didn't try to generate the header (still yet find how)

The header generation scripts are in the headers repo, specifically:

https://github.com/KhronosGroup/OpenCL-Headers/tree/main/scripts

The CMake file for the headers defines two targets related to header generation: headers_generate, and headers_copy. headers_generate generates the headers to an output temporary directory and headers_copy copies the genreated headers into the right place. You will need to tell the header generation scripts where to find the spec XML file, which you can do using the CMake variable OPENCL_HEADERS_XML_PATH.

If you want to do everything in one go, a good workflow is something like:

make headers_generate headers_copy all test -j

This generates the headers, copies them to the right place, builds all tests, and runs all tests.

@karolherbst
Copy link
Contributor

karolherbst commented Dec 18, 2024

Now with the CL_MEM_DEVICE_SHARED_ADDRESS option remove, I don't think the current interface of clSetKernelArgDevicePointerEXT makes much sense.

The good thing about binding buffer objects via clSetKernelArg or setting SVM pointers via clSetKernelArgSVMPointer is, that you don't have to change the argument to enqueue the same kernel on a different device.

Now that each device has its own address, one would have to synchronize the set address with the queues launching those kernels, potentially involing a lot of flushing when it wouldn't be necessary.

Though using sub-buffers if one doesn't want to use the base address is also quite the annoying interface. Maybe we need a clSetKernelArgBuffer, which comes with some useful features, like an offset argument? Might also want to add a couple of more things while at it.

Or we just stick with sub-buffers after all...

EDIT: same issue with CL_KERNEL_EXEC_INFO_DEVICE_PTRS_EXT

@pjaaskel
Copy link
Contributor Author

Now with the CL_MEM_DEVICE_SHARED_ADDRESS option remove, I don't think the current interface of clSetKernelArgDevicePointerEXT makes much sense.

I noticed the same that it's a bit cumbersome, but it maps to the HIP/CUDA programming model where one device is targeted at the time with the hipSetDevice() API.

The good thing about binding buffer objects via clSetKernelArg or setting SVM pointers via clSetKernelArgSVMPointer is, that you don't have to change the argument to enqueue the same kernel on a different device.

Yes, I agree.

Now that each device has its own address, one would have to synchronize the set address with the queues launching those kernels, potentially involving a lot of flushing when it wouldn't be necessary.

In practice it means that if you submit the kernel to multiple devices' command queues, you have to set the arg for each submission. The client can still pass the cl_mem handle or subbuffers using the core clSetKernelArg() API if they don't want to touch the per-device pointers or set them one at a time.

Though using sub-buffers if one doesn't want to use the base address is also quite the annoying interface. Maybe we need a clSetKernelArgBuffer, which comes with some useful features, like an offset argument? Might also want to add a couple of more things while at it.

Or we just stick with sub-buffers after all...

If you recall, we discussed and considered this option in the beginning, but it was not good enough for HIP/CUDA due to some subbuffer restrictions (alignment at least, but perhaps something else I forgot) and for the fact that some APIs prefer to use the raw pointer address. It might be doable if we lift those restrictions with the extension and then the runtime has to keep book of what raw address maps to which (sub)buffer when interfacing from APIs with raw pointers, but is that a much better option than this one?

@karolherbst
Copy link
Contributor

Now that each device has its own address, one would have to synchronize the set address with the queues launching those kernels, potentially involving a lot of flushing when it wouldn't be necessary.

In practice it means that if you submit the kernel to multiple devices' command queues, you have to set the arg for each submission. The client can still pass the cl_mem handle or subbuffers using the core clSetKernelArg() API if they don't want to touch the per-device pointers or set them one at a time.

Oh right, implementations have to look at the kernel arguments set on clEnqueueNDRangeKernel call time, not sure why I thought it's any different.

Though using sub-buffers if one doesn't want to use the base address is also quite the annoying interface. Maybe we need a clSetKernelArgBuffer, which comes with some useful features, like an offset argument? Might also want to add a couple of more things while at it.
Or we just stick with sub-buffers after all...

If you recall, we discussed and considered this option in the beginning, but it was not good enough for HIP/CUDA due to some subbuffer restrictions (alignment at least, but perhaps something else I forgot) and for the fact that some APIs prefer to use the raw pointer address. It might be doable if we lift those restrictions with the extension and then the runtime has to keep book of what raw address maps to which (sub)buffer when interfacing from APIs with raw pointers, but is that a much better option than this one?

I think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.

clSetKernelArgBuffer could also get a device parameter added and then the value is specific to the device. That might be feasible and usable? I think that also allows for better debugability and error checking.

@pjaaskel
Copy link
Contributor Author

I think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.

I'm not aware. If you want to extend the buffer API, I think it should be a separate extension.

Another problem in addition to the fixed alignment I recall with the approach of (ab)using sub-buffers for implementing the raw pointer passing is that sub-buffers have a size. We discussed using size 0 sub-buffers for this and it started to seem hacky, therefore we went with this raw pointer API plan, IIRC. I can try to dig up the old notes if you want to go back on the drawing board with this extension.

clSetKernelArgBuffer could also get a device parameter added and then the value is specific to the device. That might be feasible and usable? I think that also allows for better debugability and error checking.

Do you mean that we could call clSetKernelArgBufferAddress() multiple times for each device holding the buffer? We could, but then again it doesn't differ much from calling it multiple times for each NDRange enqueue of that kernel (to different devices' queues), in which case the device is implied from the queue it is being pushed to. Perhaps I can just add a sentence that highlights the fact that the raw pointer's device is implied from the command queue the kernel is enqueued to?

@karolherbst
Copy link
Contributor

I think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.

I'm not aware. If you want to extend the buffer API, I think it should be a separate extension.

Another problem in addition to the fixed alignment I recall with the approach of (ab)using sub-buffers for implementing the raw pointer passing is that sub-buffers have a size. We discussed using size 0 sub-buffers for this and it started to seem hacky, therefore we went with this raw pointer API plan, IIRC. I can try to dig up the old notes if you want to go back on the drawing board with this extension.

Oh right yeah, that's impossible to solve with sub-buffers properly without changing semantics.

clSetKernelArgBuffer could also get a device parameter added and then the value is specific to the device. That might be feasible and usable? I think that also allows for better debugability and error checking.

Do you mean that we could call clSetKernelArgBufferAddress() multiple times for each device holding the buffer? We could, but then again it doesn't differ much from calling it multiple times for each NDRange enqueue of that kernel (to different devices' queues), in which case the device is implied from the queue it is being pushed to. Perhaps I can just add a sentence that highlights the fact that the raw pointer's device is implied from the command queue the kernel is enqueued to?

It could allow you to skip calling clSetKernelArgBufferAddress when and argument won't have to change, but that might be niche enough to not matter.

However it kinda matters more with this API, because it's probably not intended to be thread-safe as clSetKernelArg isn't. And this extension might want to modify the "Multiple Host Threads" section as well. So if you have arguments which would be the same pointer for a given device you could skip those calls if you set the pointer once for each device.

Like I can see the pattern where an application simply binds all arguments once and then uses commands to update the content of those arguments and never the kernel objects itself. And they can do it from multiple threads concurrently and safely as long as they don't modify the kernel object, like calling clSetKernelArg*.

If you require to call clSetKernelArgBufferAddress before each enqueue, application will have to synchronize calls to it from multiple threads. Unless of course clSetKernelArgBufferAddress will be considered thread-safe, which then might make the problem a much smaller one.

@pjaaskel
Copy link
Contributor Author

Unless of course clSetKernelArgBufferAddress will be considered thread-safe, which then might make the problem a much smaller one.

I see. What if we simply declare the call thread safe here and add a note of the problem in the spec? Somehow I'm reluctant to add the explicit device argument to the argument setter function as it's not needed in the basic single device use case.

@karolherbst
Copy link
Contributor

I see. What if we simply declare the call thread safe here and add a note of the problem in the spec? Somehow I'm reluctant to add the explicit device argument to the argument setter function as it's not needed in the basic single device use case.

My concern is that most implementations aren't set up to make it thread-safe easily, because it will require reworking how setting arguments is implemented, meaning it's a higher bar to implement this extension. In rusticl it's not an issue, because at the moment it is implemented in a thread-safe way, though I did consider dropping the thread-safety once I get to improve performance more and if it's a significant overhead showing up in applications.

Though one could argue, that all the other clSetKernelArg* calls are device independent and the reason they don't have a device arg is, because it was never needed, just clSetKernelArgBufferAddress being the exception here. Though I don't think it's a huge issue to add it. Nor is it a huge issue not to.

I think the one benefit of adding is, that the runtime could verify if the passed in handle is indeed a valid one. And it could also catch use-after-free use cases when you set a handle, but the buffer gets deallocated later one and the same address becomes valid in a new allocation hiding the bug pretty well.

But if adding it is too much of an issue for chipstar (and other users) then we could also not.

I think it makes sense to wait what others think about this problem. I don't have any strong opinions here either way, just wanted to bring this issue up before the extension is finalized.

@pjaaskel
Copy link
Contributor Author

I suspect the "others" do not have strong opinions here either since the need for this extension originates from HIP/CUDA/chipStar. Somehow adding the device argument seems wrong since so far cl_kernel has been device independent. I suggest we just go with the current API as it fulfills the original need and way of usage, and focus on USVM.

@bashbaug
Copy link
Contributor

think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.

FYI, there is some work ongoing to specify an alignment at buffer creation time, see internal MR 198.

clSetKernelArgBuffer could also get a device parameter added and then the value is specific to the device. That might be feasible and usable? I think that also allows for better debugability and error checking.

Somehow I'm reluctant to add the explicit device argument to the argument setter function as it's not needed in the basic single device use case.

Perhaps consider doing something like clBuildProgram, which accepts a device list to specify which devices the program should be built for, but the device list can be NULL, in which case the program is built for all devices? I don't know if you'd need a device list or if a single argument would be sufficient.

@pjaaskel
Copy link
Contributor Author

think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.

FYI, there is some work ongoing to specify an alignment at buffer creation time, see internal MR 198.

Good to know!

clSetKernelArgBuffer could also get a device parameter added and then the value is specific to the device. That might be feasible and usable? I think that also allows for better debugability and error checking.

Somehow I'm reluctant to add the explicit device argument to the argument setter function as it's not needed in the basic single device use case.

Perhaps consider doing something like clBuildProgram, which accepts a device list to specify which devices the program should be built for, but the device list can be NULL, in which case the program is built for all devices? I don't know if you'd need a device list or if a single argument would be sufficient.

Hmm, if we change the clSetKernelArgDevicePointerEXT() pointer arg to a list of pointers with a pointer per device in the context it would match the buffer info query's return value meaning we could pass its return value directly to this API. The annoying part here is the potential discrepancy with the clBuildProgram's device list: The kernel might not have been built for all kernels in the context whereas the buffers are by default associated with all devices. If we make the list passed to the arg setter to match with the list of devices passed to clBuildProgram, the client has to filter the pointers from the context superset. Likely not a big deal, I suspect the device list in build program is typically not a subset of the context's devices - a rarely if never used feature.

In this sense, Karol's proposal of setting the pointer separately for each device might be actually easier from the client perspective. But then again I don't see the "thread safety" issue of the current proposal a major one as cl_kernels can be cloned etc. In the end I don't have strong preferences and would just go with the current simple option, but can also change it to whatever you prefer.

FYI, Michal implemented the current 0.9.1 version in PoCL.

@karolherbst
Copy link
Contributor

I forgot that clCloneKernel existed... yeah that's definitely the cheapest option to get around the thread-safety problem.

@karolherbst
Copy link
Contributor

karolherbst commented Dec 21, 2024

Anyway, updated my implementation to 0.9.1 as well, though I still treat CL_MEM_DEVICE_PRIVATE_ADDRESS_EXT as a flag, because that's also what the current version of the CTS test is doing. Well.. it should also work as a property, but that's untested.

@karolherbst
Copy link
Contributor

Another thing I was wondering about today. This extension interacts with SVM in a weird way. One can clSVMAlloc memory and then pass it via a host_ptr into clCreateBuffer or clCreateBufferWithProperties. If CL_MEM_USE_HOST_PTR is also specified as a flag when doing so, the cl_mem object basically wraps the SVM allocation (the CTS has a couple of tests for this).

Now with this extension one can do a CL_MEM_USE_HOST_PTR buffer allocation with an SVM pointer and CL_MEM_DEVICE_PRIVATE_ADDRESS_EXT set to CL_TRUE. That should mean that the addresses returned by CL_MEM_DEVICE_ADDRESS_EXT also match the SVM pointer, no? I wonder if this should explicitly be stated or not. And if the CTS should test for this.

This extension could also just disallow this behavior if this sounds too much of an edge case nobody is going to care about.

Just wanted to point out that this is a real possibility, which implementations might have to face supporting both (as I'm now).

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Jan 2, 2025

This extension could also just disallow this behavior if this sounds too much of an edge case nobody is going to care about.

Just wanted to point out that this is a real possibility, which implementations might have to face supporting both (as I'm now).

This is a good point. I'd just disallow this as if (CG) SVM is supported by the implementation, then this extension is rather pointless as this extension is supposed to be a simplification to CG SVM.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Jan 2, 2025

...but having said that, if the implementation does support CG SVM, it might still support BDA for legacy/compatibility reasons and in that case the other behavior (the "device ptr" = SVM pointer) would make sense. Other opinions?

@karolherbst
Copy link
Contributor

...but having said that, if the implementation does support CG SVM, it might still support BDA for legacy/compatibility reasons and in that case the other behavior (the "device ptr" = SVM pointer) would make sense. Other opinions?

yeah.. I mean it shouldn't be hard for the impl to simply return the SVM pointer for those BDA allocations, because the cl_mem object wrapping an SVM allocation is probably 100x the amount of work compared to handling this edge case.

The normal host_ptr path can have a different address on the GPU side (e.g. if the host memory couldn't be mapped into the GPUs VM), which I think this extension will also have to clarify, but this guarantee also doesn't exist in the core spec (unless it's an SVM allocation).

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Jan 3, 2025

Right. I'll add the other option, returning the SVM pointer in this case, in the next specification revision.

@pjaaskel
Copy link
Contributor Author

When adding the sentence about SVM, is this also ready to be marked 1.0.0 and merged in (wondering should I do it with the same commit)?

@karolherbst
Copy link
Contributor

No concerns in regards to that from my side. I think from a technical perspective it's in a good shape to land, though I don't want to rule out that more clarifications might be needed once others implement it as well.

The basic cl_mem buffer API doesn't enable access to the underlying raw
pointers in the device memory, preventing its use in host side
data structures that need pointer references to objects. This API
adds a minimal increment on top of cl_mem that provides such
capabilities.
@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from f9d2828 to 6e8cbe4 Compare January 15, 2025 09:01
@pjaaskel
Copy link
Contributor Author

I cleaned up the commit history and the history description in the docs and upped it to 1.0.0. The headers generated OK. IMHO we could merge this in and update the CTS next.

Copy link
Contributor

@karolherbst karolherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read through it again as I'm updating my implementation. I left a nit and a comment, but the later shouldn't block this.

api/opencl_runtime_layer.asciidoc Outdated Show resolved Hide resolved
_kernel_ ({clEnqueueNDRangeKernel} and {clEnqueueTask}) until the argument
value is changed by a call to {clSetKernelArgDevicePointerEXT} for _kernel_.
The device pointer can only be used for arguments that are declared to be a
pointer to `global` memory allocated with {clCreateBufferWithProperties} with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it say global and constant or is this all restricted to global memory? I'm fine either way, though it might be good to point it out if it's not supported for constant memory.

For some hardware/implementations it's more or less the same, so might be better to be more explicit about it before one implementation supports it for constant and another doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Is it even possible to allocate buffers from the constant space and assign them as arguments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit complicated. Nvidia GPUs have a hardware path for constant memory which doesn't really allow global addresses, however in command submission you'd still just upload from a global address, so it might be better to not allow it.

However modern nvidia GPUs can do the same with bindless constant buffers where the global address can be used (though the size of the entire access would be needed, but the runtime could handle it internally).

I don't think other hardware has a similar restriction as it's often simply a global load instruction with a special caching mode.

Though I think I'm leaning towards not allowing it for now, because it might make it a performance trade of for some implementation.

Copy link
Contributor Author

@pjaaskel pjaaskel Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant from the OpenCL API perspective. Surely the constant AS can be mapped to whatever memory physically in HW if wanted.

Copy link
Contributor

@karolherbst karolherbst Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely the constant AS can be mapped to whatever memory physically in HW if wanted.

It depends. If you make use of Nvidia's const buffers, then the answer is no. Or you add a lot of compiler smartness to make it somehow happen, but it's quite a bit of work. Modern nvidia GPUs (last 5 years) can deal a bit better, but it's still a performance trade-off nonetheless. It's quite a different thing on the ISA level and it's a huge perf gain to not use virtual addresses for constant memory at all as the instruction pulling from constant memory behave more like pulling data from registers instead of VRAM.

There are also push constants in other vendors hardware which could be used to implement the constant AS, and for those similar restrictions apply.

From an API perspective it doesn't matter much however and you assign the same type of buffers to kernel arguments being pointers to the global or constant AS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing special about global and constant there as both simply use cl_mem buffers. Constant in the kernels are a bit different, as you can also have constant pointers point to them, but to set the value of e.g. a __constant float *tmpF kernel argument, you simply call clSetKernelArg(kernel, 0, sizeof(cl_mem), &cl_mem); with a ordinary cl_mem created through clCreateBuffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. This is interesting and a bit weird also: Doesn't it mean that constant and global address spaces cannot be disjoint but actually map to the same address space, otherwise how the device can arbitrate between these address spaces? Some ISAs might even have different instructions for accessing either and constant could be actually a read-only memory in HW.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it mean that constant and global address spaces cannot be disjoint but actually map to the same address space, otherwise how the device can arbitrate between these address spaces?

it depends on a lot of things. As explained above, there are hardware paths a compiler could use to make the constant address space faster, but at least from a GPU programming perspective those things are initialized either from constant data the host sent or from a buffer in the global address space.

But the semantics in the hardware shaders/kernels are entirely disjoint, so if the hw paths are chosen, you can't access the constant AS with global operations directly (would need to pull the address from somewhere else).

On nvidia it's e.g. 16 or more buffers of 64kb size, and the index is a 0 based vec2 (index + offset) and it's all bindful (meaning you program the individual slots when launching the shader/kernel, so you don't guarantee that the constant address remains stable across invocations at all).

So most of the constant weirdness is just part of the command submission when launching the kernel and can even happen on the GPU independently from the host, e.g. you can write to a global buffer and use it as a hw constant buffer in the next kernel, without the host having to do anything to update the contents of the constant buffer slots as it happens all on the GPU.

OpenCL implementations might also use those hardware buffers for in source constants, especially if they are indirectly accessed or huge tables.

So from an ISA perspective it can look entirely different, while from an API perspective it looks almost the same. However, runtimes could make the global backing storage of a hardware constant buffer available to the kernel, if the compiler needs to access constant data through the global AS (e.g. through an internal driver buffer mapping from constant buffer index to global AS).

There are also instructions with can be hybrid (e.g. on nvidia there is ld.constant or bindless constant buffers needing the global address + the size of the buffer, but that's a relatively new feature), like using global addressing, but making use of the constant buffer hardware and aggressively cache data.

Of course an implementation can also simply use the hw global AS for both API global and API constant if they don't care about the performance benefits (rusticl atm doesn't use the hw constant buffers for kernel arguments, because I haven't gotten to it yet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see now what you mean, thanks. The runtime could optimize in this case and move/allocate the buffer to a constant memory if there is a separate one, the kernel arg qualifier is constant and then ISA in the kernel would always access the constant space if the HW has a disjoint memory for constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.

With this extension it's easy to use a bda enabledcl_mem object for a constant AS kernel argument as it would simply use whatever constant address the runtime would come up.

Problematic would be if this constant address needs to be the same as the global one or needs to be stable across invocations, because that might be impossible to guarantee if using more specialized hardware paths.

So with the current wording I don't think there is any issue, it might just be better to explicitly state the promises this extension gives here and maybe even make sure in the CTS that implementation doesn't allow more than this extension adds.

The only content addition since the previous version is

"If the device supports SVM and {clCreateBufferWithProperties} is
 called with a pointer returned by {clSVMAlloc} as its _host_ptr_
 argument, and {CL_MEM_USE_HOST_PTR} is set in its _flags_ argument,
 the device-side address is guaranteed to match the _host_ptr."
@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from 6e8cbe4 to 676312c Compare January 20, 2025 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants