-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cl_ext_buffer_device_address #1159
base: main
Are you sure you want to change the base?
Conversation
any motivation to get this merged? Or anything else needed to discuss before merging this? Could also try to bring it up at the CL WG if needed. |
Yep, this is still being discussed in the WG. I personally think it's useful as is and shouldn't harm anything if merged as it even has 2 implementations now. |
Thanks @SunSerega |
Alright, and now the problem I found in #1171 is visible here because the |
This comment was marked as resolved.
This comment was marked as resolved.
Yes, this was the idea. I'll add a mention in the next update. |
Updated according to @karolherbst comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already implemented the extension fully in rusticl/mesa (including sharing the same address across devices) and I think it's fine, however I'd still urge to address the concerns I have for layered implementations implementing it on top of Vulkan. I've already considered the constraints when implementing it, however I think it's better to provide clients to query if address sharing across multiple devices is supported or not.
b8df46b
to
1931416
Compare
@SunSerega thanks! |
@karolherbst I asked about this in the CL/memory WG. We need to submit CTS tests and this might be good to go then with this one. Do you have good tests in Rusticl side we could use? The test in PoCL is quite basic (and needs to be updated), but can be used as a starting point also. |
I haven't written any more tests and only used the one in pocl. But I can probably help out with writing tests here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this proposal!
9ad04fb
to
edcff73
Compare
Thanks @kpet for the feedback. Implemented most of it. |
edcff73
to
c07ff8f
Compare
After @karolherbst and @kpet are happy with this, we'll implement in PoCL and @franz will add a CTS pull request. Then we can mark this 1.0.0 and merge, I think. |
@pjaaskel do you have local changes, because locally I had to change a few things to get things to compile: diff --git a/xml/cl.xml b/xml/cl.xml
index d7bdb1a..47f52b0 100644
--- a/xml/cl.xml
+++ b/xml/cl.xml
@@ -3736,7 +3736,7 @@ server's OpenCL/api-docs repository.
<proto><type>cl_int</type> <name>clSetKernelArgDevicePointerEXT</name></proto>
<param><type>cl_kernel</type> <name>kernel</name></param>
<param><type>cl_uint</type> <name>arg_index</name></param>
- <param>const <type>void</type>* <name>arg_value</name></param>
+ <param>const <type>cl_mem_device_address_ext</type> <name>arg_value</name></param>
</command>
<command suffix="CL_API_SUFFIX__VERSION_2_0">
<proto><type>cl_int</type> <name>clSetKernelExecInfo</name></proto>
@@ -7200,6 +7200,9 @@ server's OpenCL/api-docs repository.
</require>
</extension>
<extension name="cl_ext_buffer_device_address" revision="0.9.1" supported="opencl" depends="CL_VERSION_3_0" provisional="true">
+ <require>
+ <type name="cl_mem_device_address_ext"/>
+ </require>
<require>
<command name="clSetKernelArgDevicePointerEXT"/>
</require>
@@ -7212,9 +7215,6 @@ server's OpenCL/api-docs repository.
<require comment="cl_kernel_exec_info">
<enum name="CL_KERNEL_EXEC_INFO_DEVICE_PTRS_EXT"/>
</require>
- <require>
- <type name="cl_mem_device_address_ext"/>
- </require>
</extension>
<extension name="cl_khr_command_buffer" revision="0.9.5" supported="opencl" depends="CL_VERSION_1_2" ratified="opencl" provisional="true">
<require> and the CTS PR still uses |
The header generation scripts are in the headers repo, specifically: https://github.com/KhronosGroup/OpenCL-Headers/tree/main/scripts The CMake file for the headers defines two targets related to header generation: headers_generate, and headers_copy. headers_generate generates the headers to an output temporary directory and headers_copy copies the genreated headers into the right place. You will need to tell the header generation scripts where to find the spec XML file, which you can do using the CMake variable OPENCL_HEADERS_XML_PATH. If you want to do everything in one go, a good workflow is something like:
This generates the headers, copies them to the right place, builds all tests, and runs all tests. |
Now with the The good thing about binding buffer objects via Now that each device has its own address, one would have to synchronize the set address with the queues launching those kernels, potentially involing a lot of flushing when it wouldn't be necessary. Though using sub-buffers if one doesn't want to use the base address is also quite the annoying interface. Maybe we need a Or we just stick with sub-buffers after all... EDIT: same issue with |
I noticed the same that it's a bit cumbersome, but it maps to the HIP/CUDA programming model where one device is targeted at the time with the hipSetDevice() API.
Yes, I agree.
In practice it means that if you submit the kernel to multiple devices' command queues, you have to set the arg for each submission. The client can still pass the cl_mem handle or subbuffers using the core clSetKernelArg() API if they don't want to touch the per-device pointers or set them one at a time.
If you recall, we discussed and considered this option in the beginning, but it was not good enough for HIP/CUDA due to some subbuffer restrictions (alignment at least, but perhaps something else I forgot) and for the fact that some APIs prefer to use the raw pointer address. It might be doable if we lift those restrictions with the extension and then the runtime has to keep book of what raw address maps to which (sub)buffer when interfacing from APIs with raw pointers, but is that a much better option than this one? |
Oh right, implementations have to look at the kernel arguments set on
I think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.
|
I'm not aware. If you want to extend the buffer API, I think it should be a separate extension. Another problem in addition to the fixed alignment I recall with the approach of (ab)using sub-buffers for implementing the raw pointer passing is that sub-buffers have a size. We discussed using size 0 sub-buffers for this and it started to seem hacky, therefore we went with this raw pointer API plan, IIRC. I can try to dig up the old notes if you want to go back on the drawing board with this extension.
Do you mean that we could call clSetKernelArgBufferAddress() multiple times for each device holding the buffer? We could, but then again it doesn't differ much from calling it multiple times for each NDRange enqueue of that kernel (to different devices' queues), in which case the device is implied from the queue it is being pushed to. Perhaps I can just add a sentence that highlights the fact that the raw pointer's device is implied from the command queue the kernel is enqueued to? |
Oh right yeah, that's impossible to solve with sub-buffers properly without changing semantics.
It could allow you to skip calling However it kinda matters more with this API, because it's probably not intended to be thread-safe as Like I can see the pattern where an application simply binds all arguments once and then uses commands to update the content of those arguments and never the kernel objects itself. And they can do it from multiple threads concurrently and safely as long as they don't modify the kernel object, like calling If you require to call |
I see. What if we simply declare the call thread safe here and add a note of the problem in the spec? Somehow I'm reluctant to add the explicit device argument to the argument setter function as it's not needed in the basic single device use case. |
My concern is that most implementations aren't set up to make it thread-safe easily, because it will require reworking how setting arguments is implemented, meaning it's a higher bar to implement this extension. In rusticl it's not an issue, because at the moment it is implemented in a thread-safe way, though I did consider dropping the thread-safety once I get to improve performance more and if it's a significant overhead showing up in applications. Though one could argue, that all the other I think the one benefit of adding is, that the runtime could verify if the passed in handle is indeed a valid one. And it could also catch use-after-free use cases when you set a handle, but the buffer gets deallocated later one and the same address becomes valid in a new allocation hiding the bug pretty well. But if adding it is too much of an issue for chipstar (and other users) then we could also not. I think it makes sense to wait what others think about this problem. I don't have any strong opinions here either way, just wanted to bring this issue up before the extension is finalized. |
I suspect the "others" do not have strong opinions here either since the need for this extension originates from HIP/CUDA/chipStar. Somehow adding the device argument seems wrong since so far cl_kernel has been device independent. I suggest we just go with the current API as it fulfills the original need and way of usage, and focus on USVM. |
FYI, there is some work ongoing to specify an alignment at buffer creation time, see internal MR 198.
Perhaps consider doing something like |
Good to know!
Hmm, if we change the clSetKernelArgDevicePointerEXT() pointer arg to a list of pointers with a pointer per device in the context it would match the buffer info query's return value meaning we could pass its return value directly to this API. The annoying part here is the potential discrepancy with the clBuildProgram's device list: The kernel might not have been built for all kernels in the context whereas the buffers are by default associated with all devices. If we make the list passed to the arg setter to match with the list of devices passed to clBuildProgram, the client has to filter the pointers from the context superset. Likely not a big deal, I suspect the device list in build program is typically not a subset of the context's devices - a rarely if never used feature. In this sense, Karol's proposal of setting the pointer separately for each device might be actually easier from the client perspective. But then again I don't see the "thread safety" issue of the current proposal a major one as cl_kernels can be cloned etc. In the end I don't have strong preferences and would just go with the current simple option, but can also change it to whatever you prefer. FYI, Michal implemented the current 0.9.1 version in PoCL. |
I forgot that |
Anyway, updated my implementation to 0.9.1 as well, though I still treat |
Another thing I was wondering about today. This extension interacts with SVM in a weird way. One can Now with this extension one can do a This extension could also just disallow this behavior if this sounds too much of an edge case nobody is going to care about. Just wanted to point out that this is a real possibility, which implementations might have to face supporting both (as I'm now). |
This is a good point. I'd just disallow this as if (CG) SVM is supported by the implementation, then this extension is rather pointless as this extension is supposed to be a simplification to CG SVM. |
...but having said that, if the implementation does support CG SVM, it might still support BDA for legacy/compatibility reasons and in that case the other behavior (the "device ptr" = SVM pointer) would make sense. Other opinions? |
yeah.. I mean it shouldn't be hard for the impl to simply return the SVM pointer for those BDA allocations, because the cl_mem object wrapping an SVM allocation is probably 100x the amount of work compared to handling this edge case. The normal host_ptr path can have a different address on the GPU side (e.g. if the host memory couldn't be mapped into the GPUs VM), which I think this extension will also have to clarify, but this guarantee also doesn't exist in the core spec (unless it's an SVM allocation). |
Right. I'll add the other option, returning the SVM pointer in this case, in the next specification revision. |
When adding the sentence about SVM, is this also ready to be marked 1.0.0 and merged in (wondering should I do it with the same commit)? |
No concerns in regards to that from my side. I think from a technical perspective it's in a good shape to land, though I don't want to rule out that more clarifications might be needed once others implement it as well. |
The basic cl_mem buffer API doesn't enable access to the underlying raw pointers in the device memory, preventing its use in host side data structures that need pointer references to objects. This API adds a minimal increment on top of cl_mem that provides such capabilities.
Also made the enums globally unique.
f9d2828
to
6e8cbe4
Compare
I cleaned up the commit history and the history description in the docs and upped it to 1.0.0. The headers generated OK. IMHO we could merge this in and update the CTS next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read through it again as I'm updating my implementation. I left a nit and a comment, but the later shouldn't block this.
_kernel_ ({clEnqueueNDRangeKernel} and {clEnqueueTask}) until the argument | ||
value is changed by a call to {clSetKernelArgDevicePointerEXT} for _kernel_. | ||
The device pointer can only be used for arguments that are declared to be a | ||
pointer to `global` memory allocated with {clCreateBufferWithProperties} with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it say global
and constant
or is this all restricted to global
memory? I'm fine either way, though it might be good to point it out if it's not supported for constant
memory.
For some hardware/implementations it's more or less the same, so might be better to be more explicit about it before one implementation supports it for constant
and another doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Is it even possible to allocate buffers from the constant space and assign them as arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit complicated. Nvidia GPUs have a hardware path for constant memory which doesn't really allow global addresses, however in command submission you'd still just upload from a global address, so it might be better to not allow it.
However modern nvidia GPUs can do the same with bindless constant buffers where the global address can be used (though the size of the entire access would be needed, but the runtime could handle it internally).
I don't think other hardware has a similar restriction as it's often simply a global load instruction with a special caching mode.
Though I think I'm leaning towards not allowing it for now, because it might make it a performance trade of for some implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant from the OpenCL API perspective. Surely the constant AS can be mapped to whatever memory physically in HW if wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely the constant AS can be mapped to whatever memory physically in HW if wanted.
It depends. If you make use of Nvidia's const buffers, then the answer is no. Or you add a lot of compiler smartness to make it somehow happen, but it's quite a bit of work. Modern nvidia GPUs (last 5 years) can deal a bit better, but it's still a performance trade-off nonetheless. It's quite a different thing on the ISA level and it's a huge perf gain to not use virtual addresses for constant memory at all as the instruction pulling from constant memory behave more like pulling data from registers instead of VRAM.
There are also push constants in other vendors hardware which could be used to implement the constant AS, and for those similar restrictions apply.
From an API perspective it doesn't matter much however and you assign the same type of buffers to kernel arguments being pointers to the global or constant AS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing special about global
and constant
there as both simply use cl_mem buffers. Constant in the kernels are a bit different, as you can also have constant pointers point to them, but to set the value of e.g. a __constant float *tmpF
kernel argument, you simply call clSetKernelArg(kernel, 0, sizeof(cl_mem), &cl_mem);
with a ordinary cl_mem
created through clCreateBuffer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. This is interesting and a bit weird also: Doesn't it mean that constant and global address spaces cannot be disjoint but actually map to the same address space, otherwise how the device can arbitrate between these address spaces? Some ISAs might even have different instructions for accessing either and constant could be actually a read-only memory in HW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it mean that constant and global address spaces cannot be disjoint but actually map to the same address space, otherwise how the device can arbitrate between these address spaces?
it depends on a lot of things. As explained above, there are hardware paths a compiler could use to make the constant address space faster, but at least from a GPU programming perspective those things are initialized either from constant data the host sent or from a buffer in the global address space.
But the semantics in the hardware shaders/kernels are entirely disjoint, so if the hw paths are chosen, you can't access the constant AS with global operations directly (would need to pull the address from somewhere else).
On nvidia it's e.g. 16 or more buffers of 64kb size, and the index is a 0 based vec2 (index + offset) and it's all bindful (meaning you program the individual slots when launching the shader/kernel, so you don't guarantee that the constant address remains stable across invocations at all).
So most of the constant weirdness is just part of the command submission when launching the kernel and can even happen on the GPU independently from the host, e.g. you can write to a global buffer and use it as a hw constant buffer in the next kernel, without the host having to do anything to update the contents of the constant buffer slots as it happens all on the GPU.
OpenCL implementations might also use those hardware buffers for in source constants, especially if they are indirectly accessed or huge tables.
So from an ISA perspective it can look entirely different, while from an API perspective it looks almost the same. However, runtimes could make the global backing storage of a hardware constant buffer available to the kernel, if the compiler needs to access constant data through the global AS (e.g. through an internal driver buffer mapping from constant buffer index to global AS).
There are also instructions with can be hybrid (e.g. on nvidia there is ld.constant
or bindless constant buffers needing the global address + the size of the buffer, but that's a relatively new feature), like using global addressing, but making use of the constant buffer hardware and aggressively cache data.
Of course an implementation can also simply use the hw global AS for both API global and API constant if they don't care about the performance benefits (rusticl atm doesn't use the hw constant buffers for kernel arguments, because I haven't gotten to it yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see now what you mean, thanks. The runtime could optimize in this case and move/allocate the buffer to a constant memory if there is a separate one, the kernel arg qualifier is constant and then ISA in the kernel would always access the constant space if the HW has a disjoint memory for constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
With this extension it's easy to use a bda enabledcl_mem
object for a constant AS kernel argument as it would simply use whatever constant address the runtime would come up.
Problematic would be if this constant address needs to be the same as the global one or needs to be stable across invocations, because that might be impossible to guarantee if using more specialized hardware paths.
So with the current wording I don't think there is any issue, it might just be better to explicitly state the promises this extension gives here and maybe even make sure in the CTS that implementation doesn't allow more than this extension adds.
The only content addition since the previous version is "If the device supports SVM and {clCreateBufferWithProperties} is called with a pointer returned by {clSVMAlloc} as its _host_ptr_ argument, and {CL_MEM_USE_HOST_PTR} is set in its _flags_ argument, the device-side address is guaranteed to match the _host_ptr."
6e8cbe4
to
676312c
Compare
The basic cl_mem buffer API doesn't enable access to the underlying raw pointers in the device memory, preventing its use in host side data structures that need pointer references to objects. This API adds a minimal increment on top of cl_mem that provides such capabilities.
The version 0.1.0 is implemented in PoCL and rusticl for prototyping, but everything's still up for discussion. chipStar is the first client that uses the API.