-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (now in custom v3 branch, see comment for details) #159
Comments
For ease of use, we decided to just import openai's whisper implementation for transcription stage, which doesnt support batching. The one in the previous commit has some accuracy issues which I don't have time to debug rn. The 70x real time described in the paper was using a custom implementation of whisper with batching that I wont be open-sourcing for the time being. Note that others have had success using faster-whisper as a drop-in replacement for whisper in this repo: There are quite a lot of different uses-cases and trade-offs which is a bit hard to support entirely in this repo (faster-whisper, real-time transcription, low gpu memory reqs etc. etc.). For large-scale / business use-cases I will be providing an API soon (~1/3 of the price of openai's API), and also available to consult. |
David doesn't mention it, but this text is a change from what your readme said earlier. You had previously announced this code would be open source and was coming soon to the repo. Extremely disappointing that others now need to duplicate this effort. |
@mezaros although it may be a disappointment to you, this repo is intended for research purposes and all the algorithms and pipelines in the paper have been open-sourced. But thank you for the feedback |
I'm looking forward to when you feel comfortable on open-sourcing the batch processing. I rely on whisperx for transcribing youtube videos for (better) captions and past broadcasts on livestreaming platforms, and later translating them. The speed up would be nice for the past broadcasts cause they can span hours in length, so for me it takes almost the same duration as the videos in order to transcribe them. Also, did you end up publishing an updated or final version of the paper? I'm not seeing where the number for up-to a 70x speed-up is coming from in 2303.00747. |
Thanks, have you tried the faster-whisper drop in mentioned above? This should give you a ~4-5x speed-up.
The number in the table was normalized over openai's large-v2 inference speed -- which was already running at 6x real time on our v100 gpu with the VAD filter (so 12x this with ours). |
Not as of yet. I figured I'd wait for it to be eventually merged to upstream from the existing PR but I guess that won't be the case anymore given v2. I've been actually backlogging watching some videos because of this, but I suppose I'll finally give it a try now that it won't be implemented officially. Thanks again for all your work |
I see, I can look into adding faster whisper as an optional import when i have some time (I just dont want to force it since it needs specific cuda/cudnn versions) |
Update, I did some speed benchmarking on GPU, faster-whisper is good it seems, and pretty fast all things considered Model details whisper_arch: large-v2 beam_size: 5 Speed benchmark: File name: DanielKahneman_2010.wav File duration: 20min 37secs GPU: NIVIDA RTX 8000 Batch size: 16 (For whisperX)
|
The API later provided would be nice for those of us on personal computers that can't utilize batched whisperx (when/if open sourced) due to GPU limitations. I would have expected a higher WER for faster-whisper but the difference seems slightly negligible. Just to confirm, when testing faster-whisper did you still use VAD because they added support using Silero VAD a few days ago IIRC? |
Hi, is this FP16 precision for faster-whisper here? thanks |
FP16, without VAD |
Thanks. Looking forward to your batch inference in the future WhisperX. Actually I am trying to combine it together with Pyannote Diarization. The batch inference removed from WhisperX (due to the error rate problem I think) was about one time faster than this FP16 Faster_Whisper in my tests. |
How do I go about actually dropping in the drop-in replacement |
When will you be releasing the API service you mentioned, @m-bain? I'm really looking forward to it! |
Will be whisperX take any advantages of this? https://twitter.com/sanchitgandhi99/status/1649046650793648128?s=46&t=ApbND8sYhhD91NQ3JEdDbA Whisper JAX ⚡️ is a highly optimised Whisper implementation for both GPU and TPU |
I found whisper jax to use crazy amounts of GPU memory (48GB?), and also led to worse transcription quality. Anyway, I am now open-sourcing WhisperX v3 (see prerelease branch here, which includes ** the 70x realtime batched inference,** with only <16GB gpu memory using faster-whisper as a backend. The transcription quality is just as a good as original method If you want to try it out checkout the v3 branch, and let me know if you run into any issues (still testing) I am postponing the building API because it was taking up too much time from my PhD. I will return to building the API once I have improved the diarization -- a lot of work needed on that front. @DavidFarago @RaulKite @dustinjoe @mrmachine @Infinitay @mezaros |
I'm both surprised and very thankful you've decided to open-source your batch improvements. I look forward to using it, but I hope that I won't exceed over the 10GB since I'm limited by my 3080. On the other hand, sorry to hear that you won't be able to monetize your work with an API due to your research. I hope it all works out well for you. Looking forward to the future of whisperX as the short-term changes have been incredibly so far. |
I haven't benchmarked it but you should be able to get memory requirements down below 10G with any of the following:
2 & 3 might reduce transcription quality though but worth playing around to see.
Thank for your kind words, I am glad it has helped you. I will try to figure out over the next months how best to keep whisperX improving sustainably. |
Really thank you for your efforts on this great work! Had a trial on v3 and the batch inference is working properly. Yeah, I totally agree with your opinion on the difficulty of adding diarization to this efficiently. As I can see, making 30 seconds chunks could often mix different speaker's sentences this way. This makes it really difficult to differentiate later. So this way, the GPU would not be utilized efficiently without batch inference when the diarization is working, a little question, would multiprocessing help somehow as a temporary solution for combining ASR and Diarization? Thanks |
Hello @m-bain, it's great to know that you are trying to use faster-whisper for batch execution. It should work well overall but there is currently one limitation regarding the prompt tokens. The implementation currently requires that each prompt has the same number of "previous text tokens" (or put differently, the token "start of transcript" must be at the same position for each batch). I don't know if you already faced this limitation or if you are able to effectively work around it. Let me know if there are other issues. |
@guillaumekln thanks for faster-whisper! I was previously using a custom implementation but yours really speeds up beam_size>1 and reduces gpu mem reqs 👌🏽 Yes there are a few limitations / assumptions when doing batched transcription but transcription quality remained high. These assumptions are: (i) transcribe (ii) Identical prompt tokens like you say, I find its not an issue since Of course (i) can be quite limiting due to the need for timestamped transcripts, but in WhisperX timestamps are sourced from VAD & wav2vec2 alignment -- from my research findings Whisper timestamps were just too unreliable |
Very surprised and happy to hear you're open sourcing batch inference. I managed to get whisper.cpp to work with whisperX v2. It was not really a drop-in but not too much changes had to be done either. Now that you're enabled batch inference, there is no need for any kind of PR for this, instead this might be the correct thread to share it (if you think this doesn't fit in here, please give me a hint).
Optionally you can add padding by replacing
I am using the python bindings by aarnphm here. When using finetuned models, the whole segment-part could be broken. I'd advise to use single segment mode in this case. |
Not sure if this is the right thread, but is it possible to reduce the pyannote diarization time too, by using some logic similar to that of faster-whisper? ie, using CTranslate2, reducing floating point precision, some sort of batching etc? Currently diarization takes more time than the transcription itself. @guillaumekln @m-bain |
The quicker fix would certainly be exporting the pyannote model to ONNX. Should speed it up too. |
Hi thanks. Any pointers to a minimal amount of code required to wrap faster-whisper for adding support for this? Also, is this batching VAD segments of a given audio file and disabling |
@ozancaglayan The main branch does exactly this
@DigilConfianz yes pyannote is pretty slow. It's a lot faster and we found it effective for dialogue in movie scenes when constraining the diarization to sentence segments. See Appendix Section A (page 13) of https://www.robots.ox.ac.uk/~vgg/publications/2023/Han23/han23.pdf Will add support for this diarization module at some point |
@m-bain is v3 still opensourced ? link giving 404 |
I'm not sure, but I think the link to the v3 branch above has been merged into the main branch. |
Would batching be able to support multiple audio files? Such as multiple user requests from Triton? |
I'm looking to transcribe multiple audio files at once with WhisperX - purely batch inference. Can anyone point me in the right direction? |
cool, I will try |
hello, have you implemented this? I'm trying to do the same thing. |
The
README.md
says "more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (not provided in this repo)".Will this eventually be integrated into this repo, too? That would be really awesome. If so, is there a rough time estimate when it will be integrated?
Is this related to #57?
The text was updated successfully, but these errors were encountered: