We open sourced our two simulated datasets, VCTK-TTS and VCTK-Stutter. The download links are as follows:
Dataset | URL |
---|---|
VCTK-TTS | link |
VCTK-Stutter | link |
${DATASET}
├── disfluent_audio/ # simulated audio (.wav)
├── disfluent_labels/ # simualted labels (.json)
└── gt_text/ # ground truth text (.txt)
Please refer environment.yml
If you have Miniconda/Anaconda installed, you can directly use the command: conda env create -f environment.yml
We opensourced our inference code and checkpoints, here are the steps to perform inference:
-
Clone this repository.
-
Download VITS pretrained model, here we use
pretrained_ljs.pth
. -
Download Yolo-Stutter-checkpoints, create a folder under
yolo-stutter
, namedsaved_models
, and put all downloaded models into it. -
We also provide testing datasets for quick inference, you can download it here.
-
Build Monotonic Alignment Search
cd yolo-stutter/monotonic_align
python setup.py build_ext --inplace
- Run
yolo-stutter/etc/inference.ipynb
to perform inference step by step.
We use VITS as our TTS model.
-
Clone this repository.
-
Download VITS pretrained models, here we need
pretrained_vctk.pth
to achieve multi-speaker.- create a folder
dysfluency_simulation/path/to
, and put the downloaded model into it.
- create a folder
-
Build Monotonoic Alignment Search
cd dysfluency_simulation/monotonic_align
python setup.py build_ext --inplace
- Generate simulated speech
# Phoneme level
python generate_phn.py
# Word level
python generate_word.py
If you find our paper helpful, please cite it by:
@inproceedings{zhou24e_interspeech,
title = {YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection},
author = {Xuanru Zhou and Anshul Kashyap and Steve Li and Ayati Sharma and Brittany Morin and David Baquirin and Jet Vonk and Zoe Ezzes and Zachary Miller and Maria Tempini and Jiachen Lian and Gopala Anumanchipalli},
year = {2024},
booktitle = {Interspeech 2024},
pages = {937--941},
doi = {10.21437/Interspeech.2024-1855},
}