Whisperx_sub is a subtitle generation tool for video transcription and translation, based on Whisperx.
-
Utilizing the Faster whisper speech model, it can generate English subtitles up to 60x faster than the original Whisper, with a 30-minute video taking only 1-3 minutes to produce complete subtitles.
-
The author of Whisperx has addressed the issues of hallucination and inaccurate word pausing in Whisper by leveraging the wav2vec model, achieving precision at the second level.
-
By summarizing English sentence segmentation rules and using natural language tools like nltk, it automatically segments longer English sentences without affecting translation, accurately handling 70% of long sentences, significantly reducing the workload of subsequent timing adjustments.
-
Utilizing the common Ollama local large language model, it can achieve highly accurate translation of English subtitles, automatically generating corresponding Chinese subtitles. It also supports remote large language models, enabling faster and more accurate batch translation (batch translation), strongly recommended for use. In batch translation mode, it has contextual memory capabilities, allowing for more accurate translation of sentences.
-
Currently, the source language for audio or video only supports English (Japanese support will be attempted in the future), while the target translation language supports multiple languages, with translation effectiveness depending on your LLM model.
Source Language | Target Language |
---|---|
Chinese | |
Japanese | |
English | German |
French | |
... |
The entire transcribe (transcription) process generally takes under two minutes, with longer videos potentially taking more time.
Translation time:
- Batch translation mode is about one-fifth of the video's duration.
- Sentence-by-sentence translation mode is about one-third of the video's duration. The actual effectiveness depends on the model itself.
Results for long sentences that cannot be decomposed:
Video Reference:
What is RAG Retrieval Augmented Generation [What_is_Retrieval_Augmented_Generation_RAG]
A Nvidia graphics card capable of running CUDA, with specific requirements as per Whisper.
Graphics cards with 2-10GB of VRAM should work (cards with less than 2GB have not been tested).
Whisper models come in five sizes, with larger sizes offering higher transcription accuracy. Choose the appropriate model based on your graphics card.
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en |
tiny |
~1 GB | ~32x |
base | 74 M | base.en |
base |
~1 GB | ~16x |
small | 244 M | small.en |
small |
~2 GB | ~6x |
medium | 769 M | medium.en |
medium |
~5 GB | ~2x |
large | 1550 M | N/A | large |
~10 GB | 1x |
The default model used is the corresponding faster whisper large_v2 model.
For graphics cards with smaller VRAM, it is also recommended to use the distilled models from huggingface, which, while retaining as much accuracy as possible from the large v3 model, reduce VRAM usage from 10GB to around 5GB.
Model | Params / M | Rel. Latency ↑ | Short-Form WER ↓ | Long-Form WER ↓ |
---|---|---|---|---|
large-v3 | 1550 | 1.0 | 8.4 | 11.0 |
large-v2 | 1550 | 1.0 | 9.1 | 11.7 |
distil-large-v3 | 756 | 6.3 | 9.7 | 10.8 |
distil-large-v2 | 756 | 5.8 | 10.1 | 11.6 |
distil-medium.en | 394 | 6.8 | 11.1 | 12.4 |
distil-small.en | 166 | 5.6 | 12.1 | 12.8 |
The project provides support for the Ollama model, with translation accuracy and speed depending on GPU and model performance.
Note
For local LLM models, a minimum capability of running 14B or larger is required to ensure translation quality.
Local models are recommended for sentence-by-sentence translation, while batch translation requires models larger than 110b.
Author's recommendation: Use the Alibaba qwen 1.5 chat series model, specifically the qwen 1.5 32b quantized model, which performs well on a 3090 and can accurately translate individual sentences.
Note
For users requiring high accuracy and batch translation, it is strongly recommended to use a large language model API, which significantly outperforms local models.
Warning
Batch translation requires the large language model to have strict control over its output format; models that cannot control their output format should be used with caution. Batch translation is 3-5 times faster than sentence-by-sentence translation and is more token-efficient.
Author's recommendation: Use Alibaba's qwen plus model, which achieves a perfect balance in translation quality, API pricing, and translation speed.
Clone the source code using git:
git clone https://github.com/hiddenblue/whisperx_Sub.git
Navigate to the source code directory:
cd whisperx_sub
Install dependencies based on requirements.txt:
conda create -n whisperx_sub python==3.10
conda activate whisperx_sub
You need to install Nvidia drivers for your GPU
pip install -r requirements.txt
Before running, configure the necessary information in the config.py file:
Enter the path to the target audio file, the task type, and the API information related to the large language model used for translation.
# task type
# task could be "transcribe" or "all"
task = "transcribe"
# transcribe parameters
# Variables you need to configure
# Path to the audio or video file
audio_file = "./openai_sample.mp3"
The most important part is to configure the path to the audio file that needs to be processed.
There are two task types, with the default being transcribe only:
- transcribe - Only performs transcription on the audio file.
- all - Performs transcription first, then uses the LLM for translation, requiring additional configuration for translation parameters.
On top of configuring the transcribe parameters, you need to configure a large language model that can be called:
Configure in the config.py file:
# subtitle translation parameters
is_using_local_model = False # Whether to use a local large language model, default is False
base_url = "http://localhost:11434/api/chat" # API for the large language model, local or remote
translation_model_name = "qwen:32b-chat-v1.5-q5_K_M" # LLM model API
translation_prompt = "" # Prompt used for subtitle translation, can be left empty as default prompts are built-in
srt_file_name = "" # Path to the srt file that needs to be specified when using translation mode
Execute the whisperx_sub.py file directly in the terminal or command line, or run it in an IDE:
python whisperx_sub.py
Once you see a series of messages output in the command line, the program has started executing.
Depending on the length of the video to be transcribed, wait 30 seconds to 3 minutes to obtain the transcribed subtitle file.
The subtitle files without long sentence segmentation are located in the output folder.
The subtitle files with long sentence segmentation are in the output/cut directory.
The translated subtitle files are named after the audio file with a +CN-ZH.srt extension and are located in the output directory.
├── output
│ ├── cut
│ │ ├── openai_sample.json
│ │ ├── openai_sample.srt
│ │ ├── openai_sample.tsv
│ │ ├── openai_sample.txt
│ │ └── openai_sample.vtt
│ ├── openai_sample.json
│ ├── openai_sample.srt
│ ├── openai_sample.tsv
│ ├── openai_sample.txt
│ └── openai_sample.vtt
You may need to install the ffmpeg tool to support various audio and video formats. Installation methods can be found at: https://github.com/openai/whisper#setup
- Support for more languages, currently only English is supported (the author only knows English)
- Improved long sentence segmentation capabilities, currently only 70% of issues can be resolved through manual analysis
- A graphical user interface (GUI) is needed to lower the usage threshold and make it more user-friendly for a wide audience
- Fix some issues with Whisperx itself and improve the quality of transcription
This project is licensed under the GPL-3.0 license - see the LICENSE file for details. 📄