2025-01-06
: Reduced package dependencies.
2023-07-26
: Released real-time av-asr training code.
This repository is an open-sourced framework for speech recognition, with a primary focus on visual speech (lip-reading). It is designed for end-to-end training, aiming to deliver state-of-the-art models and enable reproducibility on audio-visual speech benchmarks.
By using this repository, you can achieve a word error rate (WER) of 20.3% for visual speech recognition (VSR) and 1.0% for audio speech recognition (ASR) on LRS3. This repository also provides API and pipeline tutorials.
- Install PyTorch (pytorch, torchvision, torchaudio) and necessary packages:
pip install torch torchvision torchaudio pytorch-lightning sentencepiece av
- Prepare the dataset. Please refer to preparation.
python train.py --exp-dir=[exp_dir] \
--exp-name=[exp_name] \
--modality=[modality] \
--root-dir=[root_dir] \
--train-file=[train_file] \
--num-nodes=[num_nodes]
Required arguments
exp-dir
: Directory to save checkpoints and logs to, default:./exp
.exp-name
: Experiment name. Location of checkpoints is[exp_dir]
/[exp_name]
.modality
: Type of input modality, valid values:video
andaudio
.root-dir
: Root directory of preprocessed dataset.train-file
: Filename of training label list.num-nodes
: Number of machines used, default: 4.
Optional arguments
group-name
: Group name of the task (wandb API).val-file
: Filename of validation label list, default:lrs3_test_transcript_lengths_seg16s.csv
.test-file
: Filename of testing label list, default:lrs3_test_transcript_lengths_seg16s.csv
.gpus
: Number of gpus in each machine, default: 8.pretrained-model-path
: Path to the pre-trained model.transfer-frontend
Flag to load the front-end only, works withpretrained-model-path
.transfer-encoder
Flag to load the weights of encoder, works withpretrained-model-path
.lr
: Learning rate, default: 1e-3.warmup-epochs
: Number of epochs for warmup, default: 5.max-epochs
: Number of epochs, default: 75.max-frames
: Maximal number of frames in a batch, default: 1600.weight-decay
: Weight decay, default: 0.05.ctc-weight
: Weight of CTC loss, default: 0.1.train-num-buckets
: Bucket size for the training set, default: 400.ckpt-path
: Path of the checkpoint from which training is resumed.slurm-job-id
: Slurm job id, default: 0.debug
: Flag to use debug level for logging
Note
- For lrs3, you can fine-tune with a pre-trained lrw model at a learning rate of 0.001, or first train from scratch on a subset (23h, max duration=4sec) at 0.0002 (which is provided in model zoo), then fine-tune on the full set at 0.001. Script for subset creation is available at here. For training new datasets, please refer to instruction.
- You can customise logging in lightning
Trainer
for experiment tracking as needed. - You can set
max-frames
to the largest to fit into your GPU memory.
python eval.py --modality=[modality] \
--root-dir=[root_dir] \
--test-file=[test_file] \
--pretrained-model-path=[pretrained_model_path]
Required arguments
modality
: Type of input modality, valid values:video
andaudio
.root-dir
: Root directory of preprocessed dataset.test-file
: Filename of testing label list, default:lrs3_test_transcript_lengths_seg16s.csv
.pretrained-model-path
: Path to the pre-trained model, set to[exp_dir]/[exp_name]/model_avg_10.pth
, default:null
.
Optional arguments
decode-snr-target
: Level of signal-to-noise ratio (SNR), default: 999999.debug
: Flag to use debug level for logging
LRS3
Model | Training data (h) | WER [%] | Params (M) | MD5 |
---|---|---|---|---|
vsr_trlrs3_23h_base.pth |
438 | 93.0 | 250 | fc8db |
vsr_trlrs3_base.pth |
438 | 36.0 | 250 | c00a7 |
vsr_trlrs3vox2_base.pth |
1759 | 24.6 | 250 | 774a6 |
vsr_trlrs2lrs3vox2avsp_base.pth |
3291 | 20.3 | 250 | 49f77 |
asr_trlrs3_base.pth |
438 | 2.0 | 243 | 8af72 |
asr_trlrs3vox2_base.pth |
1759 | 1.0 | 243 | f0c5c |
Some results are slightly better than in the paper due to hyper-parameter optimisation. The av-asr code and checkpoint can be found on the released version.
We provide the following tutorials and will include more:
If you find this repository helpful, please consider citing our work:
@inproceedings{ma2023auto,
author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels},
year={2023},
pages={1-5},
doi={10.1109/ICASSP49357.2023.10096889}
}
This repository is built using the torchaudio, espnet, raven and avhubert repositories.
Code is Apache 2.0 licensed. The pre-trained models provided in this repository may have their own licenses or terms and conditions derived from the dataset used for training.
Contributions are welcome; feel free to create a PR or email me:
[Pingchuan Ma](mapingchuan0420[at]gmail.com)