-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Emotion Conversion Paper Open Source (#4895)
- Loading branch information
1 parent
1f53315
commit 0f33ccf
Showing
19 changed files
with
2,434 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,214 @@ | ||
# Textless speech emotion conversion using decomposed and discrete representations | ||
[Felix Kreuk](https://felixkreuk.github.io), Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, [Yossi Adi](https://adiyoss.github.io) | ||
|
||
_abstract_: Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is superior to the baselines in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples and code will be publicly available under the following link: https://speechbot.github.io/emotion. | ||
|
||
## Installation | ||
First, create a conda virtual environment and activate it: | ||
``` | ||
conda create -n emotion python=3.8 -y | ||
conda activate emotion | ||
``` | ||
|
||
Then, clone this repository: | ||
``` | ||
git clone https://github.com/facebookresearch/fairseq.git | ||
cd fairseq/examples/emotion_conversion | ||
git clone https://github.com/felixkreuk/speech-resynthesis | ||
``` | ||
|
||
Next, download the EmoV discrete tokens: | ||
``` | ||
wget https://dl.fbaipublicfiles.com/textless_nlp/emotion_conversion/data.tar.gz # (still in fairseq/examples/emotion_conversion) | ||
tar -xzvf data.tar.gz | ||
``` | ||
|
||
Your `fairseq/examples/emotion_conversion` directory should like this: | ||
``` | ||
drwxrwxr-x 3 felixkreuk felixkreuk 0 Feb 6 2022 data | ||
drwxrwxr-x 3 felixkreuk felixkreuk 0 Sep 28 10:41 emotion_models | ||
drwxr-xr-x 3 felixkreuk felixkreuk 0 Jun 29 05:43 fairseq_models | ||
drwxr-xr-x 3 felixkreuk felixkreuk 0 Sep 28 10:41 preprocess | ||
-rw-rw-r-- 1 felixkreuk felixkreuk 11K Dec 5 09:00 README.md | ||
-rw-rw-r-- 1 felixkreuk felixkreuk 88 Mar 6 2022 requirements.txt | ||
-rw-rw-r-- 1 felixkreuk felixkreuk 13K Jun 29 06:26 synthesize.py | ||
``` | ||
|
||
Lastly, install fairseq and the other packages: | ||
``` | ||
pip install --editable ./ | ||
pip install -r examples/emotion_conversion/requirements.txt | ||
``` | ||
|
||
## Data preprocessing | ||
|
||
### Convert your audio to discrete representations | ||
Please follow the steps described [here](https://github.com/pytorch/fairseq/tree/main/examples/hubert/simple_kmeans). | ||
To generate the same discrete representations please use the following: | ||
1. [HuBERT checkpoint](https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt) | ||
2. k-means model at `data/hubert_base_ls960_layer9_clusters200/data_hubert_base_ls960_layer9_clusters200.bin` | ||
|
||
### Construct data splits | ||
This step will use the discrete representations from the previous step and split them to train/valid/test sets for 3 tasks: | ||
1. Translation model pre-training (BART language denoising) | ||
2. Translation model training (content units emotion translation mechanism) | ||
3. HiFiGAN model training (for synthesizing audio from discrete representations) | ||
|
||
Your processed data should be at `data/`: | ||
1. `hubert_base_ls960_layer9_clusters200` - discrete representations extracted using HuBERT layer 9, clustered into 200 clusters. | ||
2. `data.tsv` - a tsv file pointing to the EmoV dataset in your environment (Please edit the first line of this file according to your path). | ||
|
||
The following command will create the above splits: | ||
``` | ||
python examples/emotion_conversion/preprocess/create_core_manifest.py \ | ||
--tsv data/data.tsv \ | ||
--emov-km data/hubert_base_ls960_layer9_clusters200/data.km \ | ||
--km data/hubert_base_ls960_layer9_clusters200/vctk.km \ | ||
--dict data/hubert_base_ls960_layer9_clusters200/dict.txt \ | ||
--manifests-dir $DATA | ||
``` | ||
* Set `$DATA` as the directory that will contain the processed data. | ||
|
||
### Extract F0 | ||
To train the HiFiGAN vocoder we need to first extract the F0 curves: | ||
``` | ||
python examples/emotion_conversion/preprocess/extract_f0.py \ | ||
--tsv data/data.tsv \ | ||
--extractor pyaapt \ | ||
``` | ||
|
||
## HiFiGAN training | ||
Now we are all set to train the HiFiGAN vocoder: | ||
``` | ||
python examples/emotion_conversion/speech-resynthesis/train.py | ||
--checkpoint_path <hifigan-checkpoint-dir> \ | ||
--config examples/emotion_conversion/speech-resynthesis/configs/EmoV/emov_hubert-layer9-cluster200_fixed-spkr-embedder_f0-raw_gst.json | ||
``` | ||
|
||
## Translation Pre-training | ||
Before translating emotions, we first need to pre-train the translation model as a denoising autoencoder (similarly to BART). | ||
``` | ||
python train.py \ | ||
$DATA/fairseq-data/emov_multilingual_denoising_cross-speaker_dedup_nonzeroshot/tokenized \ | ||
--save-dir <your-save-dir> \ | ||
--tensorboard-logdir <your-tb-dir> \ | ||
--langs neutral,amused,angry,sleepy,disgusted,vctk.km \ | ||
--dataset-impl mmap \ | ||
--task multilingual_denoising \ | ||
--arch transformer_small --criterion cross_entropy \ | ||
--multilang-sampling-alpha 1.0 --sample-break-mode eos --max-tokens 16384 \ | ||
--update-freq 1 --max-update 3000000 \ | ||
--dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.0 \ | ||
--optimizer adam --weight-decay 0.01 --adam-eps 1e-06 \ | ||
--clip-norm 0.1 --lr-scheduler polynomial_decay --lr 0.0003 \ | ||
--total-num-update 3000000 --warmup-updates 10000 --fp16 \ | ||
--poisson-lambda 3.5 --mask 0.3 --mask-length span-poisson --replace-length 1 --rotate 0 --mask-random 0.1 --insert 0 --permute-sentences 1.0 \ | ||
--skip-invalid-size-inputs-valid-test \ | ||
--user-dir examples/emotion_conversion/fairseq_models | ||
``` | ||
|
||
## Translation Training | ||
Now we are ready to train our emotion translation model: | ||
``` | ||
python train.py \ | ||
--distributed-world-size 1 \ | ||
$DATA/fairseq-data/emov_multilingual_translation_cross-speaker_dedup/tokenized/ \ | ||
--save-dir <your-save-dir> \ | ||
--tensorboard-logdir <your-tb-dir> \ | ||
--arch multilingual_small --task multilingual_translation \ | ||
--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \ | ||
--lang-pairs neutral-amused,neutral-sleepy,neutral-disgusted,neutral-angry,amused-sleepy,amused-disgusted,amused-neutral,amused-angry,angry-amused,angry-sleepy,angry-disgusted,angry-neutral,disgusted-amused,disgusted-sleepy,disgusted-neutral,disgusted-angry,sleepy-amused,sleepy-neutral,sleepy-disgusted,sleepy-angry \ | ||
--optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \ | ||
--lr 1e-05 --clip-norm 0 --dropout 0.1 --attention-dropout 0.1 \ | ||
--weight-decay 0.01 --warmup-updates 2000 --lr-scheduler inverse_sqrt \ | ||
--max-tokens 4096 --update-freq 1 --max-update 100000 \ | ||
--required-batch-size-multiple 8 --fp16 --num-workers 4 \ | ||
--seed 2 --log-format json --log-interval 25 --save-interval-updates 1000 \ | ||
--no-epoch-checkpoints --keep-best-checkpoints 1 --keep-interval-updates 1 \ | ||
--finetune-from-model <path-to-model-from-previous-step> \ | ||
--user-dir examples/emotion_conversion/fairseq_models | ||
``` | ||
* To share encoders/decoders use the `--share-encoders` and `--share-decoders` flags. | ||
* To add source/target emotion tokens use the `--encoder-langtok {'src'|'tgt'}` and `--decoder-langtok` flags. | ||
|
||
## F0-predictor Training | ||
The following command trains the F0 prediction module: | ||
``` | ||
cd examples/emotion_conversion | ||
python -m emotion_models.pitch_predictor n_tokens=200 \ | ||
train_tsv="$DATA/denoising/emov/train.tsv" \ | ||
train_km="$DATA/denoising/emov/train.km" \ | ||
valid_tsv="$DATA/denoising/emov/valid.tsv" \ | ||
valid_km="$DATA/denoising/emov/valid.km" | ||
``` | ||
* See `hyra.run.dir` to configure directory for saving models. | ||
|
||
## Duration-predictor Training | ||
The following command trains the duration prediction modules: | ||
``` | ||
cd examples/emotion_conversion | ||
for emotion in "neutral" "amused" "angry" "disgusted" "sleepy"; do | ||
python -m emotion_models.duration_predictor n_tokens=200 substring=$emotion \ | ||
train_tsv="$DATA/denoising/emov/train.tsv" \ | ||
train_km="$DATA/denoising/emov/train.km" \ | ||
valid_tsv="$DATA/denoising/emov/valid.tsv" \ | ||
valid_km="$DATA/denoising/emov/valid.km" | ||
done | ||
``` | ||
* See `hyra.run.dir` to configure directory for saving models. | ||
* After the above command you should have 5 duration models in your checkpoint directory: | ||
``` | ||
❯ ll duration_predictor/ | ||
total 21M | ||
-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15 2021 amused.ckpt | ||
-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15 2021 angry.ckpt | ||
-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15 2021 disgusted.ckpt | ||
-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15 2021 neutral.ckpt | ||
-rw-rw-r-- 1 felixkreuk felixkreuk 4.1M Nov 15 2021 sleepy.ckpt | ||
``` | ||
|
||
## Token Generation | ||
The following command uses `fairseq-generate` to generate the token sequences based on the source and target emotions. | ||
``` | ||
fairseq-generate \ | ||
$DATA/fairseq-data/emov_multilingual_translation_cross-speaker_dedup/tokenized/ \ | ||
--task multilingual_translation \ | ||
--gen-subset test \ | ||
--path <your-saved-translation-checkpoint> \ | ||
--beam 5 \ | ||
--batch-size 4 --max-len-a 1.8 --max-len-b 10 --lenpen 1 --min-len 1 \ | ||
--skip-invalid-size-inputs-valid-test --distributed-world-size 1 \ | ||
--source-lang neutral --target-lang amused \ | ||
--lang-pairs neutral-amused,neutral-sleepy,neutral-disgusted,neutral-angry,amused-sleepy,amused-disgusted,amused-neutral,amused-angry,angry-amused,angry-sleepy,angry-disgusted,angry-neutral,disgusted-amused,disgusted-sleepy,disgusted-neutral,disgusted-angry,sleepy-amused,sleepy-neutral,sleepy-disgusted,sleepy-angry \ | ||
--results-path <token-output-path> \ | ||
--user-dir examples/emotion_conversion/fairseq_models | ||
``` | ||
* Modify `--source-lang` and `--target-lang` to control for the source and target emotions. | ||
* See [fairseq documentation](https://fairseq.readthedocs.io/en/latest/command_line_tools.html#fairseq-generate) for a full overview of generation parameters (e.g., top-k/top-p sampling). | ||
|
||
## Waveform Synthesis | ||
Using the output of the above command, the HiFiGAN vocoder, and the prosody prediction modules (F0 and duration) we can now generate the output waveforms: | ||
``` | ||
python examples/emotion_conversion/synthesize.py \ | ||
--result-path <token-output-path>/generate-test.txt \ | ||
--data $DATA/fairseq-data/emov_multilingual_translation_cross-speaker_dedup/neutral-amused \ | ||
--orig-tsv examples/emotion_conversion/data/data.tsv \ | ||
--orig-km examples/emotion_conversion/data/hubert_base_ls960_layer9_clusters200/data.km \ | ||
--checkpoint-file <hifigan-checkpoint-dir>/g_00400000 \ | ||
--dur-model duration_predictor/ \ | ||
--f0-model pitch_predictor/pitch_predictor.ckpt \ | ||
-s neutral -t amused \ | ||
--outdir ~/tmp/emotion_results/wavs/neutral-amused | ||
``` | ||
* Please make sure the source and target emotions here match those of the previous command. | ||
|
||
# Citation | ||
If you find this useful in your research, please use the following BibTeX entry for citation. | ||
``` | ||
@article{kreuk2021textless, | ||
title={Textless speech emotion conversion using decomposed and discrete representations}, | ||
author={Kreuk, Felix and Polyak, Adam and Copet, Jade and Kharitonov, Eugene and Nguyen, Tu-Anh and Rivi{\`e}re, Morgane and Hsu, Wei-Ning and Mohamed, Abdelrahman and Dupoux, Emmanuel and Adi, Yossi}, | ||
journal={Conference on Empirical Methods in Natural Language Processing (EMNLP)}, | ||
year={2022} | ||
} | ||
``` |
Empty file.
Oops, something went wrong.