-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Cuda Decoder #4811
base: master
Are you sure you want to change the base?
Faster Cuda Decoder #4811
Conversation
These affect both correctness and performance. - Add missing cudaStreamSynchronize() This was not caught before because we were running at smaller batch sizes, which allowed the init decoding kernels to run in parallel with the nnet3 kernels, and thus have completed at this point. At large enough batch sizes, no such parallelization is possible (all blocks of the GPU are occupied). - Faster host paged to pinned memory copy via multithreading. - Disable timing in cuda events for increased performance. Before (on A100 PCIe): Overall: Aggregate Total Time: 26.6364 Total Audio: 194525 RealTimeX: 7302.96 After (on A100 PCIe): Overall: Aggregate Total Time: 26.0323 Total Audio: 194525 RealTimeX: 7472.43 - In online decoder, Create writers before initializing cuda. CUDA initialization creates a lot of virtual memory (for unified virtual memory, if I understand correctly) that can cause errors if memory oversubscription is not set high enough when using the fork() syscall. The issue is further described here: https://groups.google.com/g/kaldi-help/c/3hc0xsRpqqY?pli=1 - Add cudaProfilerStart/Stop to online binary - Name H2H copy threads in NSight Systems.
time as cuda calls on the device.
Note that the max RTFx in online mode is necessarily --num-parallel-streaming-channels
Use a thread pool that sleeps when there is no data to retrieve. Sort data at the right pooint to improve cache performance. Remove spin locks with atomics. These cause slow downs compared to condition variables, in particular, because we cannot sleep accurate for 200 microseconds or less. (A 200 microsecond sleep turns out tot ake 250 microseconds). These delays cause unnecessary slow down.
FYI, CI is faling with:
|
This is to fix a CI error. It appears that this is from using "ubuntu-latest" in the CI workflow. It got upgraded to ubuntu 22.04 automatically, and this doesn't have python2.7 by default.
Fixed CI (so far) |
FYI @danpovey you might find these very low latencies exciting. I'm going to be incorporating this into https://github.com/nvidia-riva/riva-asrlib-decoder (via the kaldi submodule within that project) so that CTC models (and hopefully something like your FSA-based RNN-T decoder) can benefit as well. |
|
||
namespace kaldi { | ||
|
||
class join_threads { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seem like the class name not follow Kaldi naming convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was adapted from a book that uses a different style: https://github.com/kaldi-asr/kaldi/pull/4811/files#diff-d472827499864b67e7925a3ea6f3b95d7b7cba4d0bc745a786c0bac6258fffc5R18-R19
@galv thread-pool-light.h is deleted from the latest commit, but batched-threaded-nnet3-cuda-pipeline.h still calls it. |
@ravi-shanker-m that's been deprecated for a few years now: kaldi/src/cudadecoder/batched-threaded-nnet3-cuda-pipeline.h Lines 35 to 36 in be22248
I'm happy to go ahead and remove that code. Were you using it for some reason? |
For this question, I've tried both cuda decoder v1 vs v2 and v1 give better RTF in my case so my old service using this implementation. Maybe nvidia provided kaldi docker with parameter optimized for their computing resource and I have not tried enough |
@trunglebka, I'm happy to provide advice if you give more detail. I would sincerely doubt that you reach anywhere near 8000 RTFx on the v1 cuda decoder on an A100 (or whatever GPU you are using). The nvidia kaldi container is not anything special. It's just a pre-built kaldi from open source with some CI to make sure that nothing has broken. You can reproduce my work by running the librispeech model I linked in the first comment on librispeech test-clean, using the command line flags I specify. |
I've retired from my old company. In my case, after some experiment of tuning parameters using nvidia kaldi docker with T4, v1 give me about 500 RTFx but v2 just about 350 RTFx. Due to deadline I do not have enough time to experiment more so I just pick V1. So I think it maybe the problem with choosing parameters. |
@trunglebka Okay. I found several performance problems with the v2 decoder during my work on making this PR and this is very close to the "speed of light", so I'm not concerned about the v1 decoder being any better than this one. |
Yeah, just want to provide you context where v1 being used. |
Yes, that's cool! Thanks! |
@galv plz merge once you feel it's complete |
This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open. |
@galv good to merge? |
Hi, could you tell were ivectors used for 7800 RTFx? Config file for ivectors is not passed in the script above. And for chunk size = 30 batched-wav-nnet3-cuda-online gives Assertion failed: ("Please set --frames-per-chunk at least as large as the neural net " "right context" && input_frames_per_chunk_ >= total_nnet_right_context_) |
@zulkarneev does the same issue happen with the previous version of the decoder? |
Dan, what version do you mean? |
This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open. |
There were several issues recently discovered with the cuda decoder in both offline and online mode.
After my fixes, I can achieve 7800 RTFx throughput on librispeech test-clean and the model https://kaldi-asr.org/models/m13 with an A100-80GB PCIe card in the offline mode of computation. Previously, because of some unnoticed software regressions, this number was as low as 4000 RTFx, which isn't bad, admittedly.
Latency is more complicated, but here is a preliminary result with this model https://kaldi-asr.org/models/m13 on librispeech test-clean:
This was achieved via the following hyperparameter sweep:
Do note that better results can be achieved sometimes by setting maximum batch size lower than the number of channels. Average latency is, of course, much smaller. This means users can do real-time decoding at 3000-4000 audio streams concurrently.
This is the "compute" latency. It doesn't include the time spent waiting for the right hand context (21 frames, or 210 ms in this case). The point is that it is incredibly fast.