Skip to content

carankt/FastSpeech2

Repository files navigation

FastSpeech 2 ✈️

A novoice's PyTorch implementation of FastSpeech 2: Fast and High-Quality End-to-End Text to Speech based on FastSpeech implementation of Deepest-Project FastSpeech. The quality of voice samples generated by this repo is not upto mark, major reason being the use of batch_size = 8 due to inferior GPU memory and processing power. With batch_size>8 my CUDA memory ran out. I would be glad if anyone reading this repo can take up the training with batch_size as given in the paper and/or suggest ways of improving the results. 😇

Demo

Download the checkpoint from here trained on LJSpeech dataset. Place it in the training_log folder. And run the inference.ipynb. For mel to audio generation I have used MelGan from 🔦 torch hub.

Requirements

All code is writen in python 3.6.10.
requirements.txt contains the list of all packages required to run this repo.

pip install -r requirements.txt

For smooth working download the latest torch and suitable cuda version from here. This repo works with pytorch => 1.4. Not sure about the lower versions, let me know if they work.

Before moving to the next step update the hparams.py file as per your requirements.

Pre-preocessing

The folder MFA_filelist contains pre extracted alignments using Montreal Forced Aligner on the LJSpeech Dataset. For more information on using MFA visit here.

python preprocess.py -d /root_path/to/wavs/
python compute_statistics.py

Update the hparams.py file with appropraite infor about pitch and energy

Train

Make sure you have the training_log folder existing in the repo before running the below command.

python train.py

Tensorboard images


Generated vs Original Mel

Note

  • The output of the present checkpoint is not good, because of lack of training. Will update with the best checkpoint as soon as I can.
  • There are outliers in the dataset that needs to be taken care. Hopefully that can make the training more lean.
  • Using a lower batch size doesnot work well with this model.
  • Normalizing pitch and energy may also help with faster training or better convergence.
  • This model was trained on Nvidia GTX GeForce 960M 4gb, which is pretty low standard in comparison to the requirements of this model.
  • Feel free to share intresting insights.

References

Happy Learning! 😄

About

Implementation of FastSpeech 2

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published