This is a forked version of NVIDIA's tacotron2 repository, which I changed to work on NVIDIA K80 GPUs, instead of the V100 GPUs, used originally by them.
Tacotron 2 PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions.
This implementation includes distributed and fp16 support and uses the LJSpeech dataset.
Distributed and FP16 support relies on work by Christian Sarofeen and NVIDIA's Apex Library.
Results from Tensorboard while Training:
- NVIDIA GPU + CUDA cuDNN
- Download and extract the LJ Speech dataset
- Clone repo:
git clone https://github.com/RiccardoGrin/NVIDIA-tacotron2.git
- CD into repo:
cd NVIDIA-tacotron2
- Update .wav paths:
sed -i -- 's,DUMMY,/home/ubuntu/LJSpeech-1.1/wavs,g' filelists/*.txt
- Alternatively, set
load_mel_from_disk=True
inhparams.py
and update mel-spectrogram paths
- Alternatively, set
- Install pytorch 0.4
- Install python requirements:
pip install -r requirements.txt
- Change 'dist_url' in hparams.py to the repo directory, where test.dpt file does not exist
python train.py --output_directory=outdir --log_directory=logdir
- (OPTIONAL)
tensorboard --logdir=outdir/logdir
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True
This does train much faster and better than the normal training, however this may start by overflowing for a few steps, with messages similar to the following, before it starts training correctly:
'OVERFLOW! Skipping step. Attempted loss scale: 4294967296'
- Start and open a Jupyter Notebook
- Open inference.ipynb
- Follow instructions on notebook and run
Below are the inference results after 15750, and 4750 steps respectively, for the input text: "You stay in Wonderland and I show you how deep the rabbit hole goes." - Morpheus, The Matrix
You can download 'inference_test_15750.wav' and 'inference_test_4750.wav', to listen to the respective audio generated. Around step 4750, is when the network started to construct a proper alignment graph and make understandable sounds.
nv-wavenet: Faster than real-time wavenet inference
This implementation uses code from the following repos: Keith Ito, Prem Seetharaman as described in our code.
We are inspired by Ryuchi Yamamoto's Tacotron PyTorch implementation.
We are thankful to the Tacotron 2 paper authors, specially Jonathan Shen, Yuxuan Wang and Zongheng Yang.