Auto-AVSR: Lip-reading Sentences Project

Update

2025-01-06: Reduced package dependencies.

2023-07-26: Released real-time av-asr training code.

Introduction

This repository is an open-sourced framework for speech recognition, with a primary focus on visual speech (lip-reading). It is designed for end-to-end training, aiming to deliver state-of-the-art models and enable reproducibility on audio-visual speech benchmarks.

By using this repository, you can achieve a word error rate (WER) of 20.3% for visual speech recognition (VSR) and 1.0% for audio speech recognition (ASR) on LRS3. This repository also provides API and pipeline tutorials.

Setup

Install PyTorch (pytorch, torchvision, torchaudio) and necessary packages:

pip install torch torchvision torchaudio pytorch-lightning sentencepiece av

Prepare the dataset. Please refer to preparation.

Training

python train.py --exp-dir=[exp_dir] \
                --exp-name=[exp_name] \
                --modality=[modality] \
                --root-dir=[root_dir] \
                --train-file=[train_file] \
                --num-nodes=[num_nodes]

Required arguments

exp-dir: Directory to save checkpoints and logs to, default: ./exp.
exp-name: Experiment name. Location of checkpoints is [exp_dir]/[exp_name].
modality: Type of input modality, valid values: video and audio.
root-dir: Root directory of preprocessed dataset.
train-file: Filename of training label list.
num-nodes: Number of machines used, default: 4.

Optional arguments

group-name: Group name of the task (wandb API).
val-file: Filename of validation label list, default: lrs3_test_transcript_lengths_seg16s.csv.
test-file: Filename of testing label list, default: lrs3_test_transcript_lengths_seg16s.csv.
gpus: Number of gpus in each machine, default: 8.
pretrained-model-path: Path to the pre-trained model.
transfer-frontend Flag to load the front-end only, works with pretrained-model-path.
transfer-encoder Flag to load the weights of encoder, works with pretrained-model-path.
lr: Learning rate, default: 1e-3.
warmup-epochs: Number of epochs for warmup, default: 5.
max-epochs: Number of epochs, default: 75.
max-frames: Maximal number of frames in a batch, default: 1600.
weight-decay: Weight decay, default: 0.05.
ctc-weight: Weight of CTC loss, default: 0.1.
train-num-buckets: Bucket size for the training set, default: 400.
ckpt-path: Path of the checkpoint from which training is resumed.
slurm-job-id: Slurm job id, default: 0.
debug: Flag to use debug level for logging

Note

For lrs3, you can fine-tune with a pre-trained lrw model at a learning rate of 0.001, or first train from scratch on a subset (23h, max duration=4sec) at 0.0002 (which is provided in model zoo), then fine-tune on the full set at 0.001. Script for subset creation is available at here. For training new datasets, please refer to instruction.
You can customise logging in lightning Trainer for experiment tracking as needed.
You can set max-frames to the largest to fit into your GPU memory.

Testing

python eval.py --modality=[modality] \
               --root-dir=[root_dir] \
               --test-file=[test_file] \
               --pretrained-model-path=[pretrained_model_path]

Required arguments

modality: Type of input modality, valid values: video and audio.
root-dir: Root directory of preprocessed dataset.
test-file: Filename of testing label list, default: lrs3_test_transcript_lengths_seg16s.csv.
pretrained-model-path: Path to the pre-trained model, set to [exp_dir]/[exp_name]/model_avg_10.pth, default: null.

Optional arguments

decode-snr-target: Level of signal-to-noise ratio (SNR), default: 999999.
debug: Flag to use debug level for logging

Model zoo

LRS3

Model	Training data (h)	WER [%]	Params (M)	MD5
`vsr_trlrs3_23h_base.pth`	438	93.0	250	fc8db
`vsr_trlrs3_base.pth`	438	36.0	250	c00a7
`vsr_trlrs3vox2_base.pth`	1759	24.6	250	774a6
`vsr_trlrs2lrs3vox2avsp_base.pth`	3291	20.3	250	49f77
`asr_trlrs3_base.pth`	438	2.0	243	8af72
`asr_trlrs3vox2_base.pth`	1759	1.0	243	f0c5c

Some results are slightly better than in the paper due to hyper-parameter optimisation. The av-asr code and checkpoint can be found on the released version.

Tutorials

We provide the following tutorials and will include more:

Citation

If you find this repository helpful, please consider citing our work:

@inproceedings{ma2023auto,
  author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels},
  year={2023},
  pages={1-5},
  doi={10.1109/ICASSP49357.2023.10096889}
}

Acknowledgement

This repository is built using the torchaudio, espnet, raven and avhubert repositories.

License

Code is Apache 2.0 licensed. The pre-trained models provided in this repository may have their own licenses or terms and conditions derived from the dataset used for training.

Contact

Contributions are welcome; feel free to create a PR or email me:

[Pingchuan Ma](mapingchuan0420[at]gmail.com)

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
datamodule		datamodule
doc		doc
espnet/nets		espnet/nets
preparation		preparation
spm		spm
tutorials		tutorials
.gitignore		.gitignore
INSTRUCTION.md		INSTRUCTION.md
LICENSE		LICENSE
README.md		README.md
average_checkpoints.py		average_checkpoints.py
cosine.py		cosine.py
eval.py		eval.py
lightning.py		lightning.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auto-AVSR: Lip-reading Sentences Project

Update

Introduction

Setup

Training

Testing

Model zoo

Tutorials

Citation

Acknowledgement

License

Contact

About

Releases 1

Packages

Contributors 5

Languages

License

mpc001/auto_avsr

Folders and files

Latest commit

History

Repository files navigation

Auto-AVSR: Lip-reading Sentences Project

Update

Introduction

Setup

Training

Testing

Model zoo

Tutorials

Citation

Acknowledgement

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 5

Languages

Packages