Release v2.0.0: Complete refactor, brand-new variance models, universal dictionary compatibility, AMP/DDP support and much more improvements · openvpi/DiffSinger

Backwards Incompatible Changes

Dataset making pipelines

Dataset making pipelines (based on MFA) has been moved to its own repository MakeDiffSinger. The original Jupyter Notebook has been removed and replaced with command-line scripts.

Old functionality removal

The following functionalities has been removed and are not supported:

MIDI-A/B training and inference
PitchExtracter (xiaoma_pe) training and inference
Old 24 kHz vocoder (HiFi-GAN & PWG) training and inference

Environment & dependencies

Dependencies have been refactored and require re-installing. The ONNX exporting dependency has been updated to PyTorch 1.13 from PyTorch 1.8.

Model loading

Old acoustic model checkpoints should be migrated via the following script before loading:

python scripts/migrate.py ckpt <INPUT_CKPT> <OUTPUT_CKPT>

Before resuming training from old checkpoints, the following line should be added to the configuration file:

num_pad_tokens: 3

Datasets

Old datasets should be re-binarized before training.

Old data labels (transcriptions.txt) should be migrated to new transcriptions.csv via the following script before loading:

python scripts/migrate.py txt <INPUT_TXT>

Configuration files

The following configuration keys have been renamed:

g2p_dictionary => dictionary
max_tokens => max_batch_frames
max_sentences => max_batch_size
max_eval_tokens => max_val_batch_frames
max_eval_sentences => max_val_batch_size
lr => optimizer_args.lr
optimizer_adam_beta1 -> optimizer_args.beta1
optimizer_adam_beta2 -> optimizer_args.beta2
weight_decay -> optimizer_args.weight_decay
warmup_updates -> lr_scheduler_args.warmup_steps
decay_steps => lr_scheduler_args.step_size
gamma => lr_scheduler_args.gamma

DS files

DS files in v1.x format are not supported now. Please export them again with the latest version of OpenUTAU for DiffSinger for inference.

The new variance models, parameters and mechanisms

Variance models

Training, inference and deployment of the new variance models are supported.

Functionalities included:

Automatically predicts phoneme durations (Duration Predictor)
Automatically predicts the pitch curve (Pitch Diffusion)
Automatically predicts other variance parameters jointly (Multi-Variance Diffusion)

Before training variance models, the current data transcriptions should be migrated. Required operations may vary according to functionalities chosen and dictionaries used (sometimes requires manual labeling). See details at: variance-temp-solution.

Phoneme durations

In acoustic models, the users need to input duration for every phoneme, thus the acoustic model relies on phoneme duration predictors. The phoneme duration prediction modules in variance models can predict duration for every phoneme with given phoneme sequence, word division, word duration and approximate MIDI sequence.

Pitch curve

Acoustic models require explicit pitch input from outside. Pitch prediction modules in the variance models can predict pitch curve with given phoneme information and smoothened MIDI curve. The specially designed labeling system can correct bad data with many out-of-tune errors and produce accurate models.

Variance parameters

Variance parameters can bring higher expressiveness and controllability besides phoneme durations and pitch. They are predicted by the variance model with given phoneme information and pitch curve, then given to the acoustic model for the control.

NOTE: Variance parameters are represented by absolute values instead of relative values (offset), thus no default value curves are defined. For this reason, new acoustic models that accept these parameters as input should be trained besides their corresponding variance models.

Energy

Energy is defined as the RMS curve of the singing, in dB, which can control the strength of voice to a certain extent.

In DS files, energy and energy_timestep are used to control energy.

Breathiness

Breathiness is defined as the RMS curve of the aperiodic part of the singing, in dB, which can control the power of the air and unvoiced consonants in the voice.

In DS files, breathiness and breathiness_timestep are used to control breathiness.

Style fusion mechanism

All parameters that variance models support can be dynamically style-mixed. Among them, the phoneme durations are mixed in the level of phonemes, while others are mixed in the level of frames. Style fusion of different parameters as well as style fusion and timbre fusion, are independent from each other.

Style fusion controls are similar to the timbre fusion of acoustic models:

ph_spk_mix is used to control the fusion of phoneme durations.
spk_mix and spk_mix_timestep are used to control the fusion of other parameters.

Local retaking mechanism

Pitch and all other variance parameters support local retaking, i.e. re-generate curve on a continuous sub-region based on given curve segments. Meanwhile, this mechanism ensures that the retaken curve is smoothly connected to the given curve.

To retake pitch, complete phoneme information, position of the region to be retaken and the pitch curve on non-retaking regions should be given.

To retake other variance parameters, complete phoneme information, complete pitch curve, variance parameter names to retake (retaking multiple parameters in one go is supported), positions of the regions to be retaken (retaking different parameters on different regions is supported) and the parameters curves of non-retaking regions should be given.

Parameter cascading mechanism

Overall cascading logic

The cascading order of the variance model in general is: music scores => phoneme durations => pitch => other variance parameters.

Customized variance cascading

By utilizing the local retaking mechanism, the cascading order of all variance parameters except pitch can be customized at inference time. The following shows some example use cases:

Jointly predicts parameter A, B and C in one go while they do not couple with each other, thus changing one parameter will not cause others to change.
Make parameter C to be after parameter A and B, i.e., Jointly predicts A and B in one go, then use A and B to predict C. Parameter C will change on modification to parameter A and B, but parameter A and B do not influence each other.
Freeze parameter A and make parameter B and C coupled. Use A to predict B and C, and modifying either B and C will cause the other to change.

Universal dictionary and phoneme system support

The brand-new variance model and phoneme labeling system can support any dictionaries of any phoneme systems. See the variance labels migration guidelines (variance-temp-solution) and custom dictionary guidelines for more details.

Automatic mixed precision, multi-GPU and gradient accumulation

This project has been adapted to the latest version of Lightning and supports automatic mixed precision (FP16/BF16 AMP), multi-GPU training (DDP) and gradient accumulation for accelerating training and saving GPU memory. See performance tuning guidelines for more details.

Bug fixes

Fixed a bug causing the epoch count in the terminal logging to be only 1/1000 of the actual epoch.
Fixed potential file handle racing when reading/seeking dataset.
Fixed a bug causing inconsistency between joint augmentation formula and implementation.
Fixed hyper-parameters failed to render colors in some terminal and some Python versions.
Fixed messed up code backup directory structure.

License

License of this project has been changed from the MIT License to the Apache Licnese 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.0.0: Complete refactor, brand-new variance models, universal dictionary compatibility, AMP/DDP support and much more improvements