Faster Cuda Decoder #4811

galv · 2022-12-13T17:26:41Z

There were several issues recently discovered with the cuda decoder in both offline and online mode.

After my fixes, I can achieve 7800 RTFx throughput on librispeech test-clean and the model https://kaldi-asr.org/models/m13 with an A100-80GB PCIe card in the offline mode of computation. Previously, because of some unnoticed software regressions, this number was as low as 4000 RTFx, which isn't bad, admittedly.

Latency is more complicated, but here is a preliminary result with this model https://kaldi-asr.org/models/m13 on librispeech test-clean:

This was achieved via the following hyperparameter sweep:

for chunk_size in 21 30 40 50; do
    for num_streaming_channels in 1000 2000 3000 4000 5000 6000; do
        max_batch_size=$((num_streaming_channels>4000 ? 4000 : num_streaming_channels))
        /home/dgalvez/scratch/code/asr/kaldi-a100-perf//src/cudadecoderbin/batched-wav-nnet3-cuda-online --num-channels=$((num_streaming_channels * 2)) --cuda-use-tensor-cores=true --main-q-capacity=30\
000 --aux-q-capacity=400000 --cuda-memory-proportion=0.5 --max-batch-size=$max_batch_size --cuda-worker-threads=12 --file-limit=-1 --cuda-decoder-copy-threads=4 --batching-copy-threads=8 --frame-subsam\
pling-factor=3 --frames-per-chunk=$chunk_size --max-mem=100000000 --beam=10 --lattice-beam=7 --acoustic-scale=1.0 --determinize-lattice=true --max-active=10000 --iterations=10 --file-limit=-1 --config=\
/home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//conf/online.conf --num-parallel-streaming-channels=$num_streaming_channels --word-symbol-table=/home/dgalvez/scratch/code/a\
sr/kaldi-a100-perf/workspace//models/LibriSpeech//words.txt /home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//final.mdl /home/dgalvez/scratch/code/asr/kaldi-a100-perf/worksp\
ace//models/LibriSpeech//HCLG.fst scp:/home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//datasets/LibriSpeech/test_clean//wav_conv.scp 'ark:|gzip -c > /tmp/results/LibriSpeech/52/0/lat.gz' # 2> \
output.log                                                                                                                                                                                                
        cat output.log | grep -A 1 "Latencies" | grep -v "Latencies" | awk 'BEGIN { OFS = ","; ORS = ""} {print $3,$4,$5,$6}' >> $result_file
        echo ",${chunk_size},${num_streaming_channels},${max_batch_size}" >> $result_file
    done
done

Do note that better results can be achieved sometimes by setting maximum batch size lower than the number of channels. Average latency is, of course, much smaller. This means users can do real-time decoding at 3000-4000 audio streams concurrently.

This is the "compute" latency. It doesn't include the time spent waiting for the right hand context (21 frames, or 210 ms in this case). The point is that it is incredibly fast.

These affect both correctness and performance. - Add missing cudaStreamSynchronize() This was not caught before because we were running at smaller batch sizes, which allowed the init decoding kernels to run in parallel with the nnet3 kernels, and thus have completed at this point. At large enough batch sizes, no such parallelization is possible (all blocks of the GPU are occupied). - Faster host paged to pinned memory copy via multithreading. - Disable timing in cuda events for increased performance. Before (on A100 PCIe): Overall: Aggregate Total Time: 26.6364 Total Audio: 194525 RealTimeX: 7302.96 After (on A100 PCIe): Overall: Aggregate Total Time: 26.0323 Total Audio: 194525 RealTimeX: 7472.43 - In online decoder, Create writers before initializing cuda. CUDA initialization creates a lot of virtual memory (for unified virtual memory, if I understand correctly) that can cause errors if memory oversubscription is not set high enough when using the fork() syscall. The issue is further described here: https://groups.google.com/g/kaldi-help/c/3hc0xsRpqqY?pli=1 - Add cudaProfilerStart/Stop to online binary - Name H2H copy threads in NSight Systems.

time as cuda calls on the device.

Note that the max RTFx in online mode is necessarily --num-parallel-streaming-channels

Use a thread pool that sleeps when there is no data to retrieve. Sort data at the right pooint to improve cache performance. Remove spin locks with atomics. These cause slow downs compared to condition variables, in particular, because we cannot sleep accurate for 200 microseconds or less. (A 200 microsecond sleep turns out tot ake 250 microseconds). These delays cause unnecessary slow down.

galv · 2022-12-13T18:47:27Z

FYI, CI is faling with:

extras/check_dependencies.sh: python2.7 is not installed
extras/check_dependencies.sh: Some prerequisites are missing; install them using the command:
  sudo apt-get install python2.7
make: *** [Makefile:39: check_required_programs] Error 1

This is to fix a CI error. It appears that this is from using "ubuntu-latest" in the CI workflow. It got upgraded to ubuntu 22.04 automatically, and this doesn't have python2.7 by default.

galv · 2022-12-13T19:17:04Z

Fixed CI (so far)

galv · 2022-12-13T22:57:25Z

FYI @danpovey you might find these very low latencies exciting. I'm going to be incorporating this into https://github.com/nvidia-riva/riva-asrlib-decoder (via the kaldi submodule within that project) so that CTC models (and hopefully something like your FSA-based RNN-T decoder) can benefit as well.

trunglebka · 2022-12-14T01:51:35Z

src/cudadecoder/thread-pool-cia.h

+
+namespace kaldi {
+
+class join_threads {


It seem like the class name not follow Kaldi naming convention

Yes, it was adapted from a book that uses a different style: https://github.com/kaldi-asr/kaldi/pull/4811/files#diff-d472827499864b67e7925a3ea6f3b95d7b7cba4d0bc745a786c0bac6258fffc5R18-R19

ravi-shanker-m · 2022-12-14T07:11:23Z

@galv thread-pool-light.h is deleted from the latest commit, but batched-threaded-nnet3-cuda-pipeline.h still calls it.

galv · 2022-12-14T16:47:48Z

@ravi-shanker-m that's been deprecated for a few years now:

kaldi/src/cudadecoder/batched-threaded-nnet3-cuda-pipeline.h

Lines 35 to 36 in be22248

    
           // This pipeline is deprecated and will be removed. Please switch to 
        
           // batched-threaded-nnet3-cuda-pipeline2

I'm happy to go ahead and remove that code.

Were you using it for some reason?

trunglebka · 2022-12-14T16:55:46Z

For this question, I've tried both cuda decoder v1 vs v2 and v1 give better RTF in my case so my old service using this implementation. Maybe nvidia provided kaldi docker with parameter optimized for their computing resource and I have not tried enough

galv · 2022-12-14T16:59:24Z

@trunglebka, I'm happy to provide advice if you give more detail. I would sincerely doubt that you reach anywhere near 8000 RTFx on the v1 cuda decoder on an A100 (or whatever GPU you are using).

The nvidia kaldi container is not anything special. It's just a pre-built kaldi from open source with some CI to make sure that nothing has broken. You can reproduce my work by running the librispeech model I linked in the first comment on librispeech test-clean, using the command line flags I specify.

trunglebka · 2022-12-14T17:15:12Z

I've retired from my old company. In my case, after some experiment of tuning parameters using nvidia kaldi docker with T4, v1 give me about 500 RTFx but v2 just about 350 RTFx. Due to deadline I do not have enough time to experiment more so I just pick V1. So I think it maybe the problem with choosing parameters.

galv · 2022-12-14T17:28:04Z

@trunglebka Okay. I found several performance problems with the v2 decoder during my work on making this PR and this is very close to the "speed of light", so I'm not concerned about the v1 decoder being any better than this one.

trunglebka · 2022-12-14T17:38:02Z

Yeah, just want to provide you context where v1 being used.

danpovey · 2022-12-15T14:03:56Z

FYI @danpovey you might find these very low latencies exciting. I'm going to be incorporating this into https://github.com/nvidia-riva/riva-asrlib-decoder (via the kaldi submodule within that project) so that CTC models (and hopefully something like your FSA-based RNN-T decoder) can benefit as well.

Yes, that's cool! Thanks!

jtrmal · 2023-02-13T21:50:16Z

@galv plz merge once you feel it's complete

stale · 2023-04-26T01:52:23Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

jtrmal · 2023-04-26T08:27:45Z

@galv good to merge?

zulkarneev · 2023-05-30T08:34:12Z

Hi, could you tell were ivectors used for 7800 RTFx? Config file for ivectors is not passed in the script above. And for chunk size = 30 batched-wav-nnet3-cuda-online gives Assertion failed: ("Please set --frames-per-chunk at least as large as the neural net " "right context" && input_frames_per_chunk_ >= total_nnet_right_context_)

danpovey · 2023-05-30T10:41:14Z

@zulkarneev does the same issue happen with the previous version of the decoder?

zulkarneev · 2023-05-30T13:16:44Z

Dan, what version do you mean?

stale · 2023-08-10T04:28:48Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

galv added 6 commits December 13, 2022 09:05

async memory copy to speed up online decoding.

2c247a1

Decrease latency by doing partial hypothesis work on host at the same

abbaa6e

time as cuda calls on the device.

New Thread pool implementation.

0d31458

Add RTFx calculation to online decoder.

dc95de1

Note that the max RTFx in online mode is necessarily --num-parallel-streaming-channels

galv added 2 commits December 13, 2022 11:03

[misc] Install python2.7

6d94122

This is to fix a CI error. It appears that this is from using "ubuntu-latest" in the CI workflow. It got upgraded to ubuntu 22.04 automatically, and this doesn't have python2.7 by default.

Make codefactor changes.

87d577f

trunglebka reviewed Dec 14, 2022

View reviewed changes

stale bot added the stale Stale bot on the loose label Apr 26, 2023

stale bot removed the stale Stale bot on the loose label Apr 26, 2023

stale bot added the stale Stale bot on the loose label Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Cuda Decoder #4811

Faster Cuda Decoder #4811

galv commented Dec 13, 2022

galv commented Dec 13, 2022

galv commented Dec 13, 2022

galv commented Dec 13, 2022

trunglebka Dec 14, 2022

galv Dec 14, 2022

ravi-shanker-m commented Dec 14, 2022 •

edited

Loading

galv commented Dec 14, 2022

trunglebka commented Dec 14, 2022

galv commented Dec 14, 2022

trunglebka commented Dec 14, 2022 •

edited

Loading

galv commented Dec 14, 2022

trunglebka commented Dec 14, 2022

danpovey commented Dec 15, 2022

jtrmal commented Feb 13, 2023

stale bot commented Apr 26, 2023

jtrmal commented Apr 26, 2023

zulkarneev commented May 30, 2023

danpovey commented May 30, 2023

zulkarneev commented May 30, 2023

stale bot commented Aug 10, 2023

Faster Cuda Decoder #4811

Are you sure you want to change the base?

Faster Cuda Decoder #4811

Conversation

galv commented Dec 13, 2022

galv commented Dec 13, 2022

galv commented Dec 13, 2022

galv commented Dec 13, 2022

trunglebka Dec 14, 2022

Choose a reason for hiding this comment

galv Dec 14, 2022

Choose a reason for hiding this comment

ravi-shanker-m commented Dec 14, 2022 • edited Loading

galv commented Dec 14, 2022

trunglebka commented Dec 14, 2022

galv commented Dec 14, 2022

trunglebka commented Dec 14, 2022 • edited Loading

galv commented Dec 14, 2022

trunglebka commented Dec 14, 2022

danpovey commented Dec 15, 2022

jtrmal commented Feb 13, 2023

stale bot commented Apr 26, 2023

jtrmal commented Apr 26, 2023

zulkarneev commented May 30, 2023

danpovey commented May 30, 2023

zulkarneev commented May 30, 2023

stale bot commented Aug 10, 2023

ravi-shanker-m commented Dec 14, 2022 •

edited

Loading

trunglebka commented Dec 14, 2022 •

edited

Loading