Skip to content

v2.0.0: Complete refactor, brand-new variance models, universal dictionary compatibility, AMP/DDP support and much more improvements

Compare
Choose a tag to compare
@yqzhishen yqzhishen released this 16 Jul 17:30
· 141 commits to main since this release
03c1bdf

Backwards Incompatible Changes

Dataset making pipelines

Dataset making pipelines (based on MFA) has been moved to its own repository MakeDiffSinger. The original Jupyter Notebook has been removed and replaced with command-line scripts.

Old functionality removal

The following functionalities has been removed and are not supported:

  • MIDI-A/B training and inference
  • PitchExtracter (xiaoma_pe) training and inference
  • Old 24 kHz vocoder (HiFi-GAN & PWG) training and inference

Environment & dependencies

Dependencies have been refactored and require re-installing. The ONNX exporting dependency has been updated to PyTorch 1.13 from PyTorch 1.8.

Model loading

Old acoustic model checkpoints should be migrated via the following script before loading:

python scripts/migrate.py ckpt <INPUT_CKPT> <OUTPUT_CKPT>

Before resuming training from old checkpoints, the following line should be added to the configuration file:

num_pad_tokens: 3

Datasets

Old datasets should be re-binarized before training.

Old data labels (transcriptions.txt) should be migrated to new transcriptions.csv via the following script before loading:

python scripts/migrate.py txt <INPUT_TXT>

Configuration files

The following configuration keys have been renamed:

  • g2p_dictionary => dictionary
  • max_tokens => max_batch_frames
  • max_sentences => max_batch_size
  • max_eval_tokens => max_val_batch_frames
  • max_eval_sentences => max_val_batch_size
  • lr => optimizer_args.lr
  • optimizer_adam_beta1 -> optimizer_args.beta1
  • optimizer_adam_beta2 -> optimizer_args.beta2
  • weight_decay -> optimizer_args.weight_decay
  • warmup_updates -> lr_scheduler_args.warmup_steps
  • decay_steps => lr_scheduler_args.step_size
  • gamma => lr_scheduler_args.gamma

DS files

DS files in v1.x format are not supported now. Please export them again with the latest version of OpenUTAU for DiffSinger for inference.

The new variance models, parameters and mechanisms

Variance models

Training, inference and deployment of the new variance models are supported.

Functionalities included:

  • Automatically predicts phoneme durations (Duration Predictor)
  • Automatically predicts the pitch curve (Pitch Diffusion)
  • Automatically predicts other variance parameters jointly (Multi-Variance Diffusion)

Before training variance models, the current data transcriptions should be migrated. Required operations may vary according to functionalities chosen and dictionaries used (sometimes requires manual labeling). See details at: variance-temp-solution.

Phoneme durations

In acoustic models, the users need to input duration for every phoneme, thus the acoustic model relies on phoneme duration predictors. The phoneme duration prediction modules in variance models can predict duration for every phoneme with given phoneme sequence, word division, word duration and approximate MIDI sequence.

Pitch curve

Acoustic models require explicit pitch input from outside. Pitch prediction modules in the variance models can predict pitch curve with given phoneme information and smoothened MIDI curve. The specially designed labeling system can correct bad data with many out-of-tune errors and produce accurate models.

Variance parameters

Variance parameters can bring higher expressiveness and controllability besides phoneme durations and pitch. They are predicted by the variance model with given phoneme information and pitch curve, then given to the acoustic model for the control.

NOTE: Variance parameters are represented by absolute values instead of relative values (offset), thus no default value curves are defined. For this reason, new acoustic models that accept these parameters as input should be trained besides their corresponding variance models.

Energy

Energy is defined as the RMS curve of the singing, in dB, which can control the strength of voice to a certain extent.

In DS files, energy and energy_timestep are used to control energy.

Breathiness

Breathiness is defined as the RMS curve of the aperiodic part of the singing, in dB, which can control the power of the air and unvoiced consonants in the voice.

In DS files, breathiness and breathiness_timestep are used to control breathiness.

Style fusion mechanism

All parameters that variance models support can be dynamically style-mixed. Among them, the phoneme durations are mixed in the level of phonemes, while others are mixed in the level of frames. Style fusion of different parameters as well as style fusion and timbre fusion, are independent from each other.

Style fusion controls are similar to the timbre fusion of acoustic models:

  • ph_spk_mix is used to control the fusion of phoneme durations.
  • spk_mix and spk_mix_timestep are used to control the fusion of other parameters.

Local retaking mechanism

Pitch and all other variance parameters support local retaking, i.e. re-generate curve on a continuous sub-region based on given curve segments. Meanwhile, this mechanism ensures that the retaken curve is smoothly connected to the given curve.

To retake pitch, complete phoneme information, position of the region to be retaken and the pitch curve on non-retaking regions should be given.

To retake other variance parameters, complete phoneme information, complete pitch curve, variance parameter names to retake (retaking multiple parameters in one go is supported), positions of the regions to be retaken (retaking different parameters on different regions is supported) and the parameters curves of non-retaking regions should be given.

Parameter cascading mechanism

Overall cascading logic

The cascading order of the variance model in general is: music scores => phoneme durations => pitch => other variance parameters.

Customized variance cascading

By utilizing the local retaking mechanism, the cascading order of all variance parameters except pitch can be customized at inference time. The following shows some example use cases:

  • Jointly predicts parameter A, B and C in one go while they do not couple with each other, thus changing one parameter will not cause others to change.
  • Make parameter C to be after parameter A and B, i.e., Jointly predicts A and B in one go, then use A and B to predict C. Parameter C will change on modification to parameter A and B, but parameter A and B do not influence each other.
  • Freeze parameter A and make parameter B and C coupled. Use A to predict B and C, and modifying either B and C will cause the other to change.

Universal dictionary and phoneme system support

The brand-new variance model and phoneme labeling system can support any dictionaries of any phoneme systems. See the variance labels migration guidelines (variance-temp-solution) and custom dictionary guidelines for more details.

Automatic mixed precision, multi-GPU and gradient accumulation

This project has been adapted to the latest version of Lightning and supports automatic mixed precision (FP16/BF16 AMP), multi-GPU training (DDP) and gradient accumulation for accelerating training and saving GPU memory. See performance tuning guidelines for more details.

Other new contents and changes

  • Documentation of this project has been refactored. The README lists all important documents and links.
  • Code structure and dependencies are significantly refactored. Some dependencies are updated.
  • Scripts for preprocessing, training, inference and deployment have been refactored and moved under scripts/.
  • A new script for deleting specific speaker embedding from model checkpoints is added.
  • PYTHONPATH and CUDA_VISIBLE_DEVICES are not required to be exported now when running preprocessing and training.
  • The speaker IDs distributed to each dataset can now be customized via the spk_ids configuration key. Giving the same ID to multiple datasets is also supported.
  • Multiprocessing binarization is now supported. The number of workers can be customized.
  • Dataset binary format has been changed to HDF5. Redundant contents are removed.
  • The learning rate and the optimizer can now be customized more freely via lr_scheduler_args and optimizer_args.
  • DDIM, DPM-Solver++ (replacement of DPM-Solver) and UniPC algorithms are supported for diffusion sampling acceleration.
  • The diffusion accelerator integrated in ONNX models has been changed to DDIM.
  • When exporting multi-speaker models, all speakers will be exported by default if the --export_spk option is unset.
  • Version of the operator set of exported ONNX models has been upgraded to 15.

Some changes may not be listed above. See the repository README for more details.

Bug fixes

  • Fixed a bug causing the epoch count in the terminal logging to be only 1/1000 of the actual epoch.
  • Fixed potential file handle racing when reading/seeking dataset.
  • Fixed a bug causing inconsistency between joint augmentation formula and implementation.
  • Fixed hyper-parameters failed to render colors in some terminal and some Python versions.
  • Fixed messed up code backup directory structure.

License

License of this project has been changed from the MIT License to the Apache Licnese 2.0.