Faster Cuda Decoder #4811

These affect both correctness and performance. - Add missing cudaStreamSynchronize() This was not caught before because we were running at smaller batch sizes, which allowed the init decoding kernels to run in parallel with the nnet3 kernels, and thus have completed at this point. At large enough batch sizes, no such parallelization is possible (all blocks of the GPU are occupied). - Faster host paged to pinned memory copy via multithreading. - Disable timing in cuda events for increased performance. Before (on A100 PCIe): Overall: Aggregate Total Time: 26.6364 Total Audio: 194525 RealTimeX: 7302.96 After (on A100 PCIe): Overall: Aggregate Total Time: 26.0323 Total Audio: 194525 RealTimeX: 7472.43 - In online decoder, Create writers before initializing cuda. CUDA initialization creates a lot of virtual memory (for unified virtual memory, if I understand correctly) that can cause errors if memory oversubscription is not set high enough when using the fork() syscall. The issue is further described here: https://groups.google.com/g/kaldi-help/c/3hc0xsRpqqY?pli=1 - Add cudaProfilerStart/Stop to online binary - Name H2H copy threads in NSight Systems.

time as cuda calls on the device.

Note that the max RTFx in online mode is necessarily --num-parallel-streaming-channels

Use a thread pool that sleeps when there is no data to retrieve. Sort data at the right pooint to improve cache performance. Remove spin locks with atomics. These cause slow downs compared to condition variables, in particular, because we cannot sleep accurate for 200 microseconds or less. (A 200 microsecond sleep turns out tot ake 250 microseconds). These delays cause unnecessary slow down.

This is to fix a CI error. It appears that this is from using "ubuntu-latest" in the CI workflow. It got upgraded to ubuntu 22.04 automatically, and this doesn't have python2.7 by default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Cuda Decoder #4811

Faster Cuda Decoder #4811

Commits on Dec 13, 2022

Faster Cuda Decoder #4811

Are you sure you want to change the base?

Faster Cuda Decoder #4811

Commits on Dec 13, 2022