-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Cuda Decoder #4811
Open
galv
wants to merge
8
commits into
kaldi-asr:master
Choose a base branch
from
galv:dgalvez/faster-cudadecoder-upstream
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Faster Cuda Decoder #4811
Commits on Dec 13, 2022
-
[src] Several cuda decoder fixes.
These affect both correctness and performance. - Add missing cudaStreamSynchronize() This was not caught before because we were running at smaller batch sizes, which allowed the init decoding kernels to run in parallel with the nnet3 kernels, and thus have completed at this point. At large enough batch sizes, no such parallelization is possible (all blocks of the GPU are occupied). - Faster host paged to pinned memory copy via multithreading. - Disable timing in cuda events for increased performance. Before (on A100 PCIe): Overall: Aggregate Total Time: 26.6364 Total Audio: 194525 RealTimeX: 7302.96 After (on A100 PCIe): Overall: Aggregate Total Time: 26.0323 Total Audio: 194525 RealTimeX: 7472.43 - In online decoder, Create writers before initializing cuda. CUDA initialization creates a lot of virtual memory (for unified virtual memory, if I understand correctly) that can cause errors if memory oversubscription is not set high enough when using the fork() syscall. The issue is further described here: https://groups.google.com/g/kaldi-help/c/3hc0xsRpqqY?pli=1 - Add cudaProfilerStart/Stop to online binary - Name H2H copy threads in NSight Systems.
Configuration menu - View commit details
-
Copy full SHA for 31d61c0 - Browse repository at this point
Copy the full SHA 31d61c0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2c247a1 - Browse repository at this point
Copy the full SHA 2c247a1View commit details -
Decrease latency by doing partial hypothesis work on host at the same
time as cuda calls on the device.
Configuration menu - View commit details
-
Copy full SHA for abbaa6e - Browse repository at this point
Copy the full SHA abbaa6eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 0d31458 - Browse repository at this point
Copy the full SHA 0d31458View commit details -
Add RTFx calculation to online decoder.
Note that the max RTFx in online mode is necessarily --num-parallel-streaming-channels
Configuration menu - View commit details
-
Copy full SHA for dc95de1 - Browse repository at this point
Copy the full SHA dc95de1View commit details -
Improve online performance of cuda decoder.
Use a thread pool that sleeps when there is no data to retrieve. Sort data at the right pooint to improve cache performance. Remove spin locks with atomics. These cause slow downs compared to condition variables, in particular, because we cannot sleep accurate for 200 microseconds or less. (A 200 microsecond sleep turns out tot ake 250 microseconds). These delays cause unnecessary slow down.
Configuration menu - View commit details
-
Copy full SHA for 501b9d9 - Browse repository at this point
Copy the full SHA 501b9d9View commit details -
This is to fix a CI error. It appears that this is from using "ubuntu-latest" in the CI workflow. It got upgraded to ubuntu 22.04 automatically, and this doesn't have python2.7 by default.
Configuration menu - View commit details
-
Copy full SHA for 6d94122 - Browse repository at this point
Copy the full SHA 6d94122View commit details -
Configuration menu - View commit details
-
Copy full SHA for 87d577f - Browse repository at this point
Copy the full SHA 87d577fView commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.