End-to-End Simultaneous Speech Translation with Differentiable Segmentation

Shaolei Zhang, Yang Feng

Source code for our ACL 2023 paper "End-to-End Simultaneous Speech Translation with Differentiable Segmentation". Differentiable Segmentation (DiSeg) is a technique that adaptively segments speech into word-level segments. DiSeg learns to segment from the underlying model in an unsupervised manner.

Overview

Installation
Quick Start
Results
Citation

Installation

DiSeg is implemented based on the open-source toolkit Fairseq, install DiSeg:

git clone https://github.com/ictnlp/DiSeg.git
cd DiSeg
pip install --editable ./

Quick Start

Data Pre-processing

We use MuST-C data from English TED talks. Download MUSTC_v1.0_en-${LANG}.tar.gz to the path ${MUSTC_ROOT}, and then preprocess it with shell_scripts/prep.sh

bash shell_scripts/prep.sh

Finally, the directory ${MUSTC_ROOT} should look like:

.
├── en-de/
│   ├── config_raw.yaml
│   ├── spm_unigram10000_raw.model
│   ├── spm_unigram10000_raw.txt
│   ├── spm_unigram10000_raw.vocab
│   ├── dev_raw_st.tsv
│   ├── tst-COMMON_raw_st.tsv
│   ├── train_raw.tsv
│   ├── tst-COMMON_raw.tsv
│   ├── tst-HE_raw.tsv
│   ├── docs/
│   ├── data/
├── en-de-text/
│   ├── train.spm.en
│   ├── train.spm.de
│   ├── dev.spm.en
│   ├── dev.spm.de
│   ├── tst-COMMON.spm.en
│   ├── tst-COMMON.spm.de
├── data-bin/
│   ├── mustc_en_de_text/
│   │   ├── dict.en.txt
│   │   ├── dict.de.txt
│   │   ├── preprocess.log
│   │   ├── ***.bin
│   │   ├── ***.idx
├── en-de-simuleval/
│   ├── tst-COMMON/
│   │   ├── tst-COMMON.de
│   │   ├── tst-COMMON.wav_list
│   │   ├── ted_****_**.wav
│   │   ├── ...
│   ├── dev/
│   │   ├── dev.de
│   │   ├── dev.wav_list
│   │   ├── ted_****_**.wav
│   │   ├── ...
└── MUSTC_v1.0_en-de.tar.gz

Config file config_raw.yaml should be like this.

bpe_tokenizer:
  bpe: sentencepiece
  sentencepiece_model: ABS_PATH_TO_SENTENCEPIECE_MODEL
input_channels: 1
prepend_tgt_lang_tag: true
use_audio_input: true
vocab_filename: spm_unigram10000_raw.txt

Training data train_raw.tsv should be like:

id      audio   n_frames        src_text        tgt_text        speaker src_lang        tgt_lang
ted_1_0 /data/zhangshaolei/datasets/MuSTC_new/en-de/data/train/wav/ted_1.wav:98720:460800       460800  And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful. I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night.   Vielen Dank, Chris. Es ist mir wirklich eine Ehre, zweimal auf dieser Bühne stehen zu dürfen. Tausend Dank dafür. Ich bin wirklich begeistert von dieser Konferenz, und ich danke Ihnen allen für die vielen netten Kommentare zu meiner Rede vorgestern Abend.   spk.1   en      de
ted_1_1 /data/zhangshaolei/datasets/MuSTC_new/en-de/data/train/wav/ted_1.wav:560160:219040      219040  And I say that sincerely, partly because (Mock sob) I need that. (Laughter)     Das meine ich ernst, teilweise deshalb — weil ich es wirklich brauchen kann! (Lachen) Versetzen Sie sich mal in meine Lage! (Lachen) (Applaus) Ich bin bin acht Jahre lang mit der Air Force Two geflogen.        spk.1   en      de
ted_1_2 /data/zhangshaolei/datasets/MuSTC_new/en-de/data/train/wav/ted_1.wav:779200:367200      367200  Now I have to take off my shoes or boots to get on an airplane! (Laughter) (Applause)   Jetzt muss ich meine Schuhe ausziehen, um überhaupt an Bord zu kommen! (Applaus)  spk.1   en      de
ted_1_3 /data/zhangshaolei/datasets/MuSTC_new/en-de/data/train/wav/ted_1.wav:1161600:65920      65920   I'll tell you one quick story to illustrate what that's been like for me.       Ich erzähle Ihnen mal eine Geschichte, dann verstehen Sie mich vielleicht besser. spk.1   en      de
ted_1_4 /data/zhangshaolei/datasets/MuSTC_new/en-de/data/train/wav/ted_1.wav:1235520:128320     128320  It's a true story — every bit of this is true. Soon after Tipper and I left the — (Mock sob) White House —      Eine wahre Geschichte — kein Wort daran ist erfunden.     spk.1   en      de
......

Training

0. (optional) Pre-training on MT Data

Pre-training on MT data can speed up the convergence of DiSeg. Note that MT pretraining is optional, you can jump to the next step to train DiSeg directly.

Pre-training on MuSTC MT data, following shell_scripts/pretrain.sh.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

MUSTC_ROOT=path_to_mustc_data
LANG=de

PRETRAIN_DIR=path_to_save_pretrained_checkpoints
W2V_MODEL=path_to_wav2vec_model


python train.py ${MUSTC_ROOT}/en-${LANG}  --text-data ${MUSTC_ROOT}/data-bin/mustc_en_${LANG}_text --tgt-lang ${LANG} --ddp-backend=legacy_ddp \
  --config-yaml config_raw.yaml \
  --train-subset train \
  --valid-subset dev \
  --save-dir ${PRETRAIN_DIR} \
  --max-tokens 2000000  --max-tokens-text 8192 \
  --update-freq 1 \
  --task speech_to_text_multitask \
  --criterion speech_to_text_multitask \
  --label-smoothing 0.1 \
  --arch convtransformer_espnet_base_wav2vec \
  --w2v2-model-path ${W2V_MODEL} \
  --optimizer adam \
  --lr 2e-3 \
  --lr-scheduler inverse_sqrt \
  --warmup-updates 8000 \
  --clip-norm 10.0 \
  --seed 1 \
  --ext-mt-training \
  --eval-task ext_mt \
  --eval-bleu \
  --eval-bleu-args '{"beam": 1,"prefix_size":1}' \
  --eval-bleu-print-samples \
  --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
  --keep-best-checkpoints 10 \
  --save-interval-updates 1000 \
  --keep-interval-updates 15 \
  --max-source-positions 800000 \
  --skip-invalid-size-inputs-valid-test \
  --dropout 0.1 --activation-dropout 0.1 --attention-dropout 0.1 --layernorm-embedding \
  --empty-cache-freq 1000 \
  --ignore-prefix-size 1 \
  --patience 10 \
  --fp16

Average best 10 checkpoints.

python scripts/average_checkpoints.py \
    --inputs ${PRETRAIN_DIR} \
    --num-update-checkpoints 10 \
    --output ${PRETRAIN_DIR}/mt_pretrain_model.pt \
    --best True

1. Training DiSeg

Download pre-trained Wav2Vec2.0 at ${W2V_MODEL}. Train DiSeg with shell_scripts/train.sh.

Multi-task learning: --st-training, --mt-training, --asr-training
Segment speech inputs: --seg-speech
Apply token-level contrastive learning: --add-speech-seg-text-ctr

PS: We find that training an offline ST model (w/o --seg-speech) and then using --seg-speech to fineturn a DiSeg model can achieve better results.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

MUSTC_ROOT=path_to_mustc_data
LANG=de

SAVE_DIR=path_to_save_checkpoints
W2V_MODEL=path_to_wav2vec_model

mean=0
var=3

# (optional) pre-train a mt encoder/decoder and load the pre-trained model with --load-pretrained-mt-encoder-decoder-from ${PRETRAIN_DIR}/mt_pretrain_model.pt
python train.py ${MUSTC_ROOT}/en-${LANG}  --tgt-lang ${LANG} --ddp-backend=legacy_ddp \
  --config-yaml config_raw.yaml \
  --train-subset train_raw \
  --valid-subset dev_raw \
  --save-dir ${SAVE_DIR} \
  --max-tokens 1500000  --batch-size 32 --max-tokens-text 4096 \
  --update-freq 1 \
  --num-workers 8 \
  --task speech_to_text_multitask \
  --criterion speech_to_text_multitask_with_seg \
  --report-accuracy \
  --arch convtransformer_espnet_base_wav2vec_seg \
  --w2v2-model-path ${W2V_MODEL} \
  --optimizer adam \
  --lr 0.0001 \
  --lr-scheduler inverse_sqrt \
  --weight-decay 0.0001 \
  --label-smoothing 0.1 \
  --warmup-updates 4000 \
  --clip-norm 10.0 \
  --seed 1 \
  --seg-encoder-layers 6 \
  --noise-mean ${mean} --noise-var ${var} \
  --st-training --mt-training --asr-training \
  --seg-speech --add-speech-seg-text-ctr \
  --eval-task st \
  --eval-bleu \
  --eval-bleu-args '{"beam": 1,"prefix_size":1}' \
  --eval-bleu-print-samples \
  --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
  --keep-best-checkpoints 20 \
  --save-interval-updates 1000 \
  --keep-interval-updates 30 \
  --max-source-positions 800000 \
  --skip-invalid-size-inputs-valid-test \
  --dropout 0.1 --activation-dropout 0.1 --attention-dropout 0.1 --layernorm-embedding \
  --empty-cache-freq 1000 \
  --ignore-prefix-size 1 \
  --fp16

Inference

1. Offline Speech Translation with DiSeg

Perform offline speech translation with shell_scripts/test.offline.sh

export CUDA_VISIBLE_DEVICES=0

MUSTC_ROOT=path_to_mustc_data
LANG=de

SAVE_DIR=path_to_save_checkpoints

python scripts/average_checkpoints.py \
    --inputs ${SAVE_DIR} \
    --num-update-checkpoints 5 \
    --output ${SAVE_DIR}/average-model.pt \
    --best True


python fairseq_cli/generate.py ${MUSTC_ROOT}/en-${LANG} --tgt-lang ${LANG} \
    --config-yaml config_raw.yaml \
    --gen-subset tst-COMMON_raw \
    --task speech_to_text_multitask \
    --path ${SAVE_DIR}/average-model.pt \
    --max-tokens 1000000 \
    --batch-size 250 \
    --beam 1 \
    --scoring sacrebleu \
    --prefix-size 1 \
    --max-source-positions 1000000 \
    --eval-task st

2. Simultaneous Speech Translation with DiSeg

Perform simultaneous speech translation with SimulEval, following shell_scripts/test.simuleval.sh

Install SimulEval@2db1a59:

cd SimulEval
pip install -e .

Simultaneous speech translation with agent diseg_agent.py:

export CUDA_VISIBLE_DEVICES=0

MUSTC_ROOT=path_to_mustc_data
LANG=de
EVAL_ROOT=path_to_save_simuleval_data
SAVE_DIR=path_to_save_checkpoints
OUTPUT_DIR=path_to_save_simuleval_results

lagging_seg=5 # lagging segment in DiSeg

simuleval --agent diseg_agent.py \
    --source ${EVAL_ROOT}/tst-COMMON/tst-COMMON.wav_list \
    --target ${EVAL_ROOT}/tst-COMMON/tst-COMMON.${LANG} \
    --data-bin ${MUSTC_ROOT}/en-${LANG} \
    --config config_raw.yaml \
    --model-path ${SAVE_DIR}/average-model.pt \
    --output ${OUTPUT_DIR} \
    --lagging-segment ${lagging_seg}  \
    --lang ${LANG} \
    --scores --gpu --fp16 \
    --port 12345

3. Segment Speech with DiSeg

You can segment any speech with a trained DiSeg model, following shell_scripts/seg.sh

export CUDA_VISIBLE_DEVICES=0

MUSTC_ROOT=path_to_mustc_data
LANG=de
SAVE_DIR=path_to_save_checkpoints
OUTPUT_SEG=path_to_save_segment

WAV=path_to_wav_file

python segment.py ${MUSTC_ROOT}/en-${LANG} \
    --task speech_to_text_multitask  \
    --config-yaml config_raw.yaml \
    --ckpt ${SAVE_DIR}/average-model.pt \
    --save-root ${OUTPUT_SEG} \
    --wav ${WAV}

Results

DiSeg's performance on MuST-C English-to-German:

k	CW	AP	AL	DAL	BLEU	TER	chrF	chrF++
1	462	0.67	1102	1518	18.85	73.13	44.29	42.31
3	553	0.76	1514	1967	20.74	69.95	49.34	47.09
5	666	0.82	1928	2338	22.11	66.90	50.13	47.94
7	850	0.86	2370	2732	22.98	65.42	50.36	48.23
9	1084	0.90	2785	3115	23.01	65.48	50.24	48.13
11	1354	0.92	3168	3464	23.13	65.04	50.42	48.31
13	1632	0.94	3575	3846	23.05	64.85	50.53	48.41
15	1935	0.96	3801	4040	23.12	64.92	50.47	48.36

DiSeg's performance on MuST-C English-to-Spanish:

k	CW	AP	AL	DAL	BLEU	TER	chrF	chrF++
1	530	0.67	1144	1625	22.03	71.34	45.69	43.82
3	563	0.76	1504	2107	24.49	66.63	53.09	50.85
5	632	0.81	1810	2364	26.58	63.35	54.55	52.39
7	788	0.85	2249	2764	27.81	61.87	55.28	53.16
9	1010	0.89	2694	3164	28.33	60.98	55.51	53.40
11	1257	0.92	3108	3530	28.59	60.63	55.64	53.55
13	1534	0.94	3479	3855	28.72	60.49	55.61	53.53
15	1835	0.95	3819	4160	28.92	60.22	55.80	53.71

Citation

If you have any questions, feel free to contact me with: zhangshaolei20z@ict.ac.cn.

If this repository is useful for you, please cite as:

@inproceedings{DiSeg,
    title = "End-to-End Simultaneous Speech Translation with Differentiable Segmentation",
    author = "Zhang, Shaolei  and
      Feng, Yang",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.485",
    pages = "7659--7680",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

End-to-End Simultaneous Speech Translation with Differentiable Segmentation

Overview

Installation

Quick Start

Data Pre-processing

Training

0. (optional) Pre-training on MT Data

1. Training DiSeg

Inference

1. Offline Speech Translation with DiSeg

2. Simultaneous Speech Translation with DiSeg

3. Segment Speech with DiSeg

Results

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

End-to-End Simultaneous Speech Translation with Differentiable Segmentation

Overview

Installation

Quick Start

Data Pre-processing

Training

0. (optional) Pre-training on MT Data

1. Training DiSeg

Inference

1. Offline Speech Translation with DiSeg

2. Simultaneous Speech Translation with DiSeg

3. Segment Speech with DiSeg

Results

Citation