Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Cuda Decoder #4811

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

galv
Copy link
Contributor

@galv galv commented Dec 13, 2022

There were several issues recently discovered with the cuda decoder in both offline and online mode.

After my fixes, I can achieve 7800 RTFx throughput on librispeech test-clean and the model https://kaldi-asr.org/models/m13 with an A100-80GB PCIe card in the offline mode of computation. Previously, because of some unnoticed software regressions, this number was as low as 4000 RTFx, which isn't bad, admittedly.

Latency is more complicated, but here is a preliminary result with this model https://kaldi-asr.org/models/m13 on librispeech test-clean:

image

This was achieved via the following hyperparameter sweep:

for chunk_size in 21 30 40 50; do
    for num_streaming_channels in 1000 2000 3000 4000 5000 6000; do
        max_batch_size=$((num_streaming_channels>4000 ? 4000 : num_streaming_channels))
        /home/dgalvez/scratch/code/asr/kaldi-a100-perf//src/cudadecoderbin/batched-wav-nnet3-cuda-online --num-channels=$((num_streaming_channels * 2)) --cuda-use-tensor-cores=true --main-q-capacity=30\
000 --aux-q-capacity=400000 --cuda-memory-proportion=0.5 --max-batch-size=$max_batch_size --cuda-worker-threads=12 --file-limit=-1 --cuda-decoder-copy-threads=4 --batching-copy-threads=8 --frame-subsam\
pling-factor=3 --frames-per-chunk=$chunk_size --max-mem=100000000 --beam=10 --lattice-beam=7 --acoustic-scale=1.0 --determinize-lattice=true --max-active=10000 --iterations=10 --file-limit=-1 --config=\
/home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//conf/online.conf --num-parallel-streaming-channels=$num_streaming_channels --word-symbol-table=/home/dgalvez/scratch/code/a\
sr/kaldi-a100-perf/workspace//models/LibriSpeech//words.txt /home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//final.mdl /home/dgalvez/scratch/code/asr/kaldi-a100-perf/worksp\
ace//models/LibriSpeech//HCLG.fst scp:/home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//datasets/LibriSpeech/test_clean//wav_conv.scp 'ark:|gzip -c > /tmp/results/LibriSpeech/52/0/lat.gz' # 2> \
output.log                                                                                                                                                                                                
        cat output.log | grep -A 1 "Latencies" | grep -v "Latencies" | awk 'BEGIN { OFS = ","; ORS = ""} {print $3,$4,$5,$6}' >> $result_file
        echo ",${chunk_size},${num_streaming_channels},${max_batch_size}" >> $result_file
    done
done

Do note that better results can be achieved sometimes by setting maximum batch size lower than the number of channels. Average latency is, of course, much smaller. This means users can do real-time decoding at 3000-4000 audio streams concurrently.

This is the "compute" latency. It doesn't include the time spent waiting for the right hand context (21 frames, or 210 ms in this case). The point is that it is incredibly fast.

These affect both correctness and performance.

- Add missing cudaStreamSynchronize()

This was not caught before because we were running at smaller batch
sizes, which allowed the init decoding kernels to run in parallel with
the nnet3 kernels, and thus have completed at this point. At large
enough batch sizes, no such parallelization is possible (all blocks of
the GPU are occupied).

- Faster host paged to pinned memory copy via multithreading.

- Disable timing in cuda events for increased performance.

Before (on A100 PCIe):

Overall:  Aggregate Total Time: 26.6364 Total Audio: 194525
RealTimeX: 7302.96

After (on A100 PCIe):

Overall:  Aggregate Total Time: 26.0323 Total Audio: 194525
RealTimeX: 7472.43

- In online decoder, Create writers before initializing cuda.

CUDA initialization creates a lot of virtual memory (for unified
virtual memory, if I understand correctly) that can cause errors if
memory oversubscription is not set high enough when using the fork()
syscall.

The issue is further described here:

https://groups.google.com/g/kaldi-help/c/3hc0xsRpqqY?pli=1

- Add cudaProfilerStart/Stop to online binary

- Name H2H copy threads in NSight Systems.
Note that the max RTFx in online mode is necessarily --num-parallel-streaming-channels
Use a thread pool that sleeps when there is no data to retrieve.

Sort data at the right pooint to improve cache performance.

Remove spin locks with atomics. These cause slow downs compared to
condition variables, in particular, because we cannot sleep accurate
for 200 microseconds or less. (A 200 microsecond sleep turns out tot ake
250 microseconds). These delays cause unnecessary slow down.
@galv
Copy link
Contributor Author

galv commented Dec 13, 2022

FYI, CI is faling with:

extras/check_dependencies.sh: python2.7 is not installed
extras/check_dependencies.sh: Some prerequisites are missing; install them using the command:
  sudo apt-get install python2.7
make: *** [Makefile:39: check_required_programs] Error 1

This is to fix a CI error.

It appears that this is from using "ubuntu-latest" in the CI
workflow. It got upgraded to ubuntu 22.04 automatically, and this
doesn't have python2.7 by default.
@galv
Copy link
Contributor Author

galv commented Dec 13, 2022

Fixed CI (so far)

@galv
Copy link
Contributor Author

galv commented Dec 13, 2022

FYI @danpovey you might find these very low latencies exciting. I'm going to be incorporating this into https://github.com/nvidia-riva/riva-asrlib-decoder (via the kaldi submodule within that project) so that CTC models (and hopefully something like your FSA-based RNN-T decoder) can benefit as well.


namespace kaldi {

class join_threads {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem like the class name not follow Kaldi naming convention

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravi-shanker-m
Copy link

ravi-shanker-m commented Dec 14, 2022

@galv thread-pool-light.h is deleted from the latest commit, but batched-threaded-nnet3-cuda-pipeline.h still calls it.

@galv
Copy link
Contributor Author

galv commented Dec 14, 2022

@ravi-shanker-m that's been deprecated for a few years now:

// This pipeline is deprecated and will be removed. Please switch to
// batched-threaded-nnet3-cuda-pipeline2

I'm happy to go ahead and remove that code.

Were you using it for some reason?

@trunglebka
Copy link

For this question, I've tried both cuda decoder v1 vs v2 and v1 give better RTF in my case so my old service using this implementation. Maybe nvidia provided kaldi docker with parameter optimized for their computing resource and I have not tried enough

@galv
Copy link
Contributor Author

galv commented Dec 14, 2022

@trunglebka, I'm happy to provide advice if you give more detail. I would sincerely doubt that you reach anywhere near 8000 RTFx on the v1 cuda decoder on an A100 (or whatever GPU you are using).

The nvidia kaldi container is not anything special. It's just a pre-built kaldi from open source with some CI to make sure that nothing has broken. You can reproduce my work by running the librispeech model I linked in the first comment on librispeech test-clean, using the command line flags I specify.

@trunglebka
Copy link

trunglebka commented Dec 14, 2022

I've retired from my old company. In my case, after some experiment of tuning parameters using nvidia kaldi docker with T4, v1 give me about 500 RTFx but v2 just about 350 RTFx. Due to deadline I do not have enough time to experiment more so I just pick V1. So I think it maybe the problem with choosing parameters.

@galv
Copy link
Contributor Author

galv commented Dec 14, 2022

@trunglebka Okay. I found several performance problems with the v2 decoder during my work on making this PR and this is very close to the "speed of light", so I'm not concerned about the v1 decoder being any better than this one.

@trunglebka
Copy link

Yeah, just want to provide you context where v1 being used.

@danpovey
Copy link
Contributor

FYI @danpovey you might find these very low latencies exciting. I'm going to be incorporating this into https://github.com/nvidia-riva/riva-asrlib-decoder (via the kaldi submodule within that project) so that CTC models (and hopefully something like your FSA-based RNN-T decoder) can benefit as well.

Yes, that's cool! Thanks!

@jtrmal
Copy link
Contributor

jtrmal commented Feb 13, 2023

@galv plz merge once you feel it's complete

@stale
Copy link

stale bot commented Apr 26, 2023

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Apr 26, 2023
@jtrmal
Copy link
Contributor

jtrmal commented Apr 26, 2023

@galv good to merge?

@stale stale bot removed the stale Stale bot on the loose label Apr 26, 2023
@zulkarneev
Copy link

Hi, could you tell were ivectors used for 7800 RTFx? Config file for ivectors is not passed in the script above. And for chunk size = 30 batched-wav-nnet3-cuda-online gives Assertion failed: ("Please set --frames-per-chunk at least as large as the neural net " "right context" && input_frames_per_chunk_ >= total_nnet_right_context_)

@danpovey
Copy link
Contributor

@zulkarneev does the same issue happen with the previous version of the decoder?

@zulkarneev
Copy link

Dan, what version do you mean?

@stale
Copy link

stale bot commented Aug 10, 2023

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale bot on the loose
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants