-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds better Affine support for GPUs when using CUDA 11. Introduces a new bias addition kernel for CUDA < 11 #778
Conversation
…y RELU for inference. When upgrading to cuda 11, the bias and relu can be fused into the matrix multiply with cublasLt.
I will stop submitting PRs for now and focus on addressing your feedback on the individual changes as it comes. The PRs submitted have the majority of the perf benefits of #743 for FP32. |
Looking forward to this one! |
FYI: looking into this now. |
@rhenry-nv can you make me a contributor in your fork? I am looking at the merge-conflicts here and would like to change a few things on the way. |
From what I am seeing here, we could in theory create an affine node that can take any activation function and run a kernel after the multiply, is that correct? This is more general than relu? |
I have resolved merge conflicts on each of these branches a week ago. Yea, we could make an affine node that takes in any activation. The reason it is RELU specific is because cublasLt only allows relu and bias addition to be fused into the gemm kernel. For other activation functions, it may make sense to use CUTLASS to get the fusion for the generic affine node. Without the fusion, I don't expect a performance benefit since with multiple kernels, we will still see launch overheads + pay to go to dram to do a small amount of work. |
FYI - I have sent an invite. I also have updates to the other branches locally merged with a more recent version of master that I will push. I will take care of any merge conflicts in the current master later this week. |
Alright, got the invite and accepted. So relu is in fact special here in this case and it makes sense to single it out. I will take a look into this. Might refactor a few things. |
Yes, relu is special in this case. Something more general is definitely a separate PR. |
Yes, I am actually partial to doing the opposite. Since affine+relu is special it deserves to be handled more like its own thing rather than trying to cram everything into the original Affine node. Gimme some time to think about that. |
Do we know if other frameworks like Pytorch have that fusion? And how they are handling that? |
I am not sure about Pytorch's capability of fusing into GEMMs. From quick googling, it looks like it can JIT fusing several pointwise ops. I'm not sure if torchscript is advanced enough to fuse into gemms. When I get some more time, I can run a simple pytorch program with nvprof to see if there is any fusion happening. For TensorRT, we do these fusions manually. AFAIK, the way one would handle fusions with gemms in general is using cutlass. |
OK, don't worry about it too much. I was mostly curious if anyone else has explicitly created an affine+relu operator. |
That would also be justified because if I am not wrong one can actually do a relatively simple gradient computation for that fused operator because of the properties of relu specifically. |
Submitted a PR to your PR in your fork. Didn't touch the cuda code itself, just some reorganization of the API code. |
Sounds good. I should have some time tomorrow to check performance. |
There seems to be a slowdown in the GPU backend compared to the first time I ran. I am not sure if it coming from a change in master or this PR. I don't think it's this PR since I checked perf some time ago. I will profile the most recent master branch next week. |
OK thanks. There was a ton of changes, so it's possible. |
Looks like the slowdown was introduced in the current master. At least for batch 64, on a proxy model, an older version of master (commit 467b15e) takes 11.44s to translate an entire dataset. The current commit takes 14.8s on the same dataset. Trying to figure this out now. |
Small refactoring of cuda_11
@emjotde Your PR looks good compared to the current master, so I merged it into this PR to unblock it for now. I profiled the GPU backend, and the time spent in forward looks similar (kernel time+ cpu time to launch kernels etc). However, the time between calls to forward in beam search looks to have doubled. The beam search code has not changed much from the 'good' commit. Do you know if the way the graph is constructed change much recently? This is my first guess, though unlikely. |
Not that I know of, but we have turned off TCMALLOC recently. Could that be it? Are you linking against TCMALLOC?
|
Yes, reverted PR #840. Will take another look at that particular problem later this week. |
Merged. Thanks for your patience. Which one next? |
Great, over to #768 then. |
Hi @rhenry-nv It complains about misaligned address as seen below:
|
Is there some sort of slicing going on in this model? Which cuda version was this using? Is this half or float precision? What are the GEMM mnk dims and is this doing both relu + bias addition? Also if it is possible, could you print the following? I came across a similar issue in 11.1 (I think) which is why the code path launches a separate kernel if the bias pointer is not a multiple of 8. However, I only checked this fix on volta and turing. A more recent version of cublas also fixes the bug on turing and volta. I will check Ampere when I get some time. |
This will be CUDA 11.1. and fp32. Currently that's all I can say until I manage to find out how to run a debug session on the new cluster. Will update here, hopefully soon. Also this is the training path, so should not really run the combined Matmul + relu (we only run that during inference). I would guess this is somehow directly caused by Matmul only. |
We use this matmul + bias fusion (no relu) inside of training (I think) which is where I initially saw the issue. As a quick check, you could set I have a couple guesses about what else could be wrong apart from this. I will take a look later this week to see if I can reproduce on an A100. |
So, currently your guess would be that the bias might not be a multiple of 16? That might be the case for the output layer. |
Yep, that is currently my best guess. |
Seems to do it. Are there any larger drawbacks? |
The drawback is that we will only get fusion when the bias is 16 byte aligned. Otherwise, we will fall back to the unfused code path which would be slightly slower. If I remember right, the Marian allocator always gives 16-byte align pointers for the GPU backend (correct me if I'm wrong). If this is true, we will only take the slower path when we slice biases. I will check a more recent CUDA version and file a bug internally if it still exists. |
256-byte aligned actually. Some time ago it turned out that matmul is a lot faster if that kind of alignment is enforced. Not sure if that is still the case. OK, I will change things to 16 then. Thanks. |
Description
This is related to issue #680. This adds better affine support for GPU.
List of changes:
Times from a proxy model with and without FP16. This model has a factored vocabulary.
This PR includes features not present in #743. I expect that this will only be more performant than #743 for FP16 using CUDA 11.
PS. I believe the time for batch 1 is lower with FP16 due to a performance issue with cublas 11. It causes high start up cost for
stridedBatchedGemms
which would impact batch 1 more than larger batch sizes. A work around for this is to bench and specify algos to cublas for different architectures. I have done this for turing (T4) and Volta (Titan V and V100) but did not include this in the PR to keep it smaller. I can add this change in if requested.Times with one stream
Times with two streams
Added dependencies: none
How to test
I ran the regression tests and most passed with CUDA 11. With CUDA 10, the regression tests pass as expected. I also manually tested on a proxy model with fp16 and the results seem sensible.
CMake command: cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on -DCOMPILE_CUDA_SM75=off -DCOMPILE_TESTS=on
Ubuntu - 18.04.3 LTS
nvcc - 10.1.243
gcc - 7.5.0
Checklist