You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to understand where the GPU parallelization comes into play.
Is it that each core in the GPU trains a separate tree?
Is it that the determination of best split at each node is parallelized on the GPU?
Is there other parallelization that occurs on the GPU?
When/what data is transferred from GPU to CPU? I imagine if this is done per split node in the tree, then it would have communication overhead.
I also noticed that max_depth is constrained. Is this in part due to the GPU implementation?
I am also trying to understand if there are known limitations, and possible performance bottlenecks. Are there any docs or links to benchmark experiments that can help a user understand this better?
The text was updated successfully, but these errors were encountered:
My specialty is more on the inference side of this question than on training, but let me see how much I can answer for you until the folks who wrote more of the training code are back in the office.
Each tree is trained on a separate CUDA stream taken from a pool of fixed size. That doesn't necessarily map to "cores," but it does mean that work on those trees can be handled in parallel. This is the highest level of parallelism in the training algorithm, but further parallelism is achieved at a more granular level.
Which brings us to your second question. Yes, for each tree, we parallelize the computation of the splits at each node over CUDA threads. Each thread handle multiple samples from the training data up to some maximum value and works on those samples in parallel to other threads.
Yes, there is additional parallelism at multiple steps of the training algorithm. For a detailed understanding, I'd recommend starting here and checking out the kernels launched in that method.
Mostly, training data is transferred from CPU to GPU at the beginning of the process and then accessed from global device memory. In principle, training could be batched in such a way that we transfer a batch to device, perform all necessary access and then move on to the next batch, but we don't currently implement that. There are some additional details around where we use host memory internally that others can answer better than I.
You may need to wait for other folks to give you a more complete answer here. My understanding is that this is because we allocate space for the maximum potential number of nodes and do not want to have to bring the entire parallel training process to a halt if we run out of room and need a reallocation.
In general, RandomForest follows most other ML algorithms in terms of its GPU acceleration characteristics. The larger the dataset or the larger the model, the greater benefit that GPUs tend to offer. The exact cutoff is hardware dependent, but you can see some example benchmarks here.
After the holidays, the folks primarily responsible for the training code can give you much more detailed answers, and if you have questions about inference in the meantime, I can answer that in as much detail as you like. Hope this at least gives you a start on what you need!
What is your question?
Hi, Thanks for the package!
I am browsing https://docs.rapids.ai/api/cuml/stable/api/#random-forest, and am wondering when one might expect speedups for the RF model over the model say in scikit-learn?
I am trying to understand where the GPU parallelization comes into play.
max_depth
is constrained. Is this in part due to the GPU implementation?I am also trying to understand if there are known limitations, and possible performance bottlenecks. Are there any docs or links to benchmark experiments that can help a user understand this better?
The text was updated successfully, but these errors were encountered: