-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] How does caching work in CUDA? #262
Comments
Can you please tell how many kernels you are seeing ? You should try Apart from JIT kernels, most functions kernels are compiled and cached in the first run. It shouldn't need 100 iterations to run fine, single iteration in the beginning is all it needs to compile/cache the functions/JIT code. If you are noticing such slow down, then it must be something else. I will run the code you shared and check, that should ideally generate a single kernel since both the statements in while loop are JITed and eval should be triggered based on some memory pressue heuristics or when explicitly called by user. |
I figured it out the print function (arrayfire::print_gen) generates 100 different kernels. After generating 100 kernels, it doesn't generate anymore and runs faster.
|
Wait a minute. Why does the print function generate kernels? |
I am able to reproduce the behavior although I don't think your conclusions are entirely correct because print functionality is not a kernel. It appears to be printing out when you call print because ArrayFire functions are asynchronous by default and until a print is called or eval is auto-triggered, the JIT nodes are not evaluated and thus the kernel is not looked up for actual execution. As far as the trace log goes, the message right from the first one says it is loaded already cached kernel, so it is not compiling anything - likely because it did once in the past on the system you are running this program. I am looking into it and will update my findings here as soon as I can. Thanks for sharing the trace output of the program. |
GPU to CPU data transfer also generates 100 different kernels. Maybe, it is the host() function that the cache lookup fails on? Also why does data transfer generate CUDA kernels?
|
As I have pointed out earlier, It is not host/print that is generating any kernels, they are triggering the JIT evaluations which just makes it appear as though host/print call is doing it. You can add do My guess is that, somehow each iteration is triggering a separate JIT evaluation although it shouldn't, perhaps the very calls Edited: Nevertheless, I think avoiding sync/eval/host kind of calls inside such a loop should definitely avoid triggering JIT evaluation. You can add an eval call after the loop which will ensure JIT evals only once for this logic. But if you have to fetch results to host for every few iterations inside the loop, you can wrap the host call with that condition so that JIT evals only when that condition is met. |
I am using host() function inside the while loop to dump data from GPU's RAM into my SSD because the data can't fit into the GPU's RAM. It would be nice if the saveArray function was implemented in rust to write data into the filesystem. |
@BA8F0D39 I have raised an feature request to follow progress of disk saving API - #263 Although those functions aren't available yet, I believe you can use serde feature I added recently to serialize and deserialize arrays. It is not in the current stable release(crate) yet but you can use it from master branch of github repository directly. I was able to reproduce the behavior of more than one kernel getting generated in upstream. I don't have any updates as of now. If your size of data is the concern, then you must be using some kind of condition to dump the data to disk rather than doing it in every iteration which is very inefficient. When the host is wrapped in such a condition, then kernels aren't evaluated in each iteration. |
@BA8F0D39 Sorry about the delay. I have figured out the reason why so many kernels are being generated. It is not a bug per say, but it is side effect of how our JIT workflow works right now. Lets take the following code (please look at comments for info) dim4 dims(4, 4);
const array a = randu(dims);
array b = randu(dims);
array c = a;
af::sync(); // I have added this sync for clear boundary between
// JIT from before loop and within loop, this does nothing otherwise
for (int i = 0; i<10; i++) {
b = b + 0.022f; // A JIT operation, in iteration JIT_NODE = JIT_NODE + SCALAR_NODE
b.eval(); // because b is eval'ed, the above iterative tree transforms C to buffer node
c = b + a; // this is a single JIT node that adds two buffer nodes
af_print(c);
} Here's what happens from lets say Nth iteration to N+1 iteration and so on
N+1 Iteration
Now if I remove the eval on b, c also becomes an iterative JIT tree that builds upon on previous iteration, thus each iteration is essentially causing different JIT kernel. for (int i = 0; i<10; i++) {
b = b + 0.022f; // This still: JIT_NODE = JIT_NODE + SCALAR_NODE
c = b + a; // This is also becomes different a JIT_NODE = JIT_NODE + JIT_NODE
af_print(c);
} The number 100 is just due to our implementation - we limit the JIT depth to a maximum of 100 at which point eval is auto-triggered. We will have internal discussion if this implementation of JIT workflow can be further improved. But, rest assured that if Thank you for using ArrayFire! and Happy New Year :) |
@BA8F0D39 I will try moving this to GitHub Discussions since it is not a bug in code neither in wrapper nor upstream. Update: Apparently, this can't be done yet due to community/community#2924 (comment) |
Thanks for the hard work. For arrayfire 3.7.3. In my debug code, I use In my release code, I need to dump matrices from GPU to CPU and the You said that the JIT depth is set to 100 but in my code the JIT can generate more than 100,000 kernels for a single matrix operation? Is the detection of previously generated kernels failing???? |
May be I wasn't clear and it added to the confusion. When I say JIT depth it is the height of the JIT tree, you may think of it like code AST. This tree's height/depth is limited to 100. When the height goes beyond the limit, the corresponding Array get's automatically evaluated generating one or more kernel(s) depending on the code you have written. For example, for a given section of code if all the lines are JIT operations, then only a single kernel is generated not one kernel per operation. Hope this clears the confusion. If there are 100k kernels cached in your system, they could be from any of the following: 1) Old invalidated cache from previous versions 2) Regular functions (non-JIT operations) also cache their kernels.
From what I gathered, for your use case you actually need to dump matrices on every iteration to the disk. In that case, I think you need a different mechanism in your application for such matrix dumping. Let me explain why. The main purpose of
I think you should maintain a separate queue that is handled by a different thread. Whenever data from main thread is ready, it would push the corresponding Another approach would be to use ArrayFire events. You can mark an event after the target operation. Now move this event to the other thread along with source array array. In the other thread you can block using the event until the required data is ready. This avoids any queues but there might be some extra performance cost by using events. Which approach fares better needs to be tested out. |
@9prady9 Is it possible to force the JIT to generate a single kernel for each rust function? Is it possible to mark sections of the code such that the JIT generates a single kernel for each section?
Does the JIT generate a new kernel when a branch statement is encountered? |
An illustrative example (not c++ code, just algorithm)
Lines 1,2 & 3 are all combined into a single kernel; then output of that kernel is fed into erode function. Line 5 is again arithmetic operation which is JIT. This creates a separate kernel. Basically, most domain (compute vision, image processing, statistics, ML, Signal Processing, Linear Algebra) specific functions are not JIT. Such functions are essentially asynchronous barriers - they won't block the thread but they will cause JIT-ed inputs of the function in question to be evaluated automatically such that relevant buffer pointers are ready for these functions to operate upon.
Calling the method
There is no such let c = arrayfire::add(&b, &a, false);
let e = arrayfire::mul(&d, &c, false);
let q = arrayfire::add(&v, &e, false);
q.eval();
Not sure I understand the question. What kind of branch statement are you asking about ? Vectorized operations don't usually have any branch instructions. |
How does caching work for a simple kernel such as adding two vectors?
On arrayfire-rust 3.7.2 CUDA backend.
Running the code generates 100 cubins in ~/.arrayfire/.
How come arrayfire generates many different kernels just for adding two vectors?
Why in the first 100 iterations, the code runs much slower?
The text was updated successfully, but these errors were encountered: