MuST-C-clean

This is the repo for paper "On the Impact of Noises in Crowd-Sourced Data for Speech Translation" in IWSLT 2022.

This detector is adapted from code in https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html#sphx-glr-intermediate-forced-alignment-with-torchaudio-tutorial-py.

Prepare Environment

conda create python=3.8 -n must-c-clean
conda activate must-c-clean

conda install -y pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install -y tqdm pandas 
conda install -y spacy -c conda-forge
python -m spacy download en_core_web_trf

pip install editdistance num2words pyyaml

Run Detection

You can run the detection as follows:

python detect.py \
    --device {cpu/cuda} \
    --mustc-root {your must-c root directory} \
    --tgt-lang {de/other languages} \
    --split {train/dev/tst-COMMON/tst-HE}

The results will be saved in results/{split}. The tsv file mismatch.tsv contains the description of the detected audio-transcript mismatch cases. The html file mismatch.html allows you to listen to the speech and compare it with the given transcript.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
detect.py		detect.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MuST-C-clean

Prepare Environment

Run Detection

About

Releases

Packages

Languages

owaski/MuST-C-clean

Folders and files

Latest commit

History

Repository files navigation

MuST-C-clean

Prepare Environment

Run Detection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages