v2.0.0: Complete refactor, brand-new variance models, universal dictionary compatibility, AMP/DDP support and much more improvements
Backwards Incompatible Changes
Dataset making pipelines
Dataset making pipelines (based on MFA) has been moved to its own repository MakeDiffSinger. The original Jupyter Notebook has been removed and replaced with command-line scripts.
Old functionality removal
The following functionalities has been removed and are not supported:
- MIDI-A/B training and inference
- PitchExtracter (xiaoma_pe) training and inference
- Old 24 kHz vocoder (HiFi-GAN & PWG) training and inference
Environment & dependencies
Dependencies have been refactored and require re-installing. The ONNX exporting dependency has been updated to PyTorch 1.13 from PyTorch 1.8.
Model loading
Old acoustic model checkpoints should be migrated via the following script before loading:
python scripts/migrate.py ckpt <INPUT_CKPT> <OUTPUT_CKPT>
Before resuming training from old checkpoints, the following line should be added to the configuration file:
num_pad_tokens: 3
Datasets
Old datasets should be re-binarized before training.
Old data labels (transcriptions.txt) should be migrated to new transcriptions.csv via the following script before loading:
python scripts/migrate.py txt <INPUT_TXT>
Configuration files
The following configuration keys have been renamed:
g2p_dictionary
=>dictionary
max_tokens
=>max_batch_frames
max_sentences
=>max_batch_size
max_eval_tokens
=>max_val_batch_frames
max_eval_sentences
=>max_val_batch_size
lr
=>optimizer_args.lr
optimizer_adam_beta1
->optimizer_args.beta1
optimizer_adam_beta2
->optimizer_args.beta2
weight_decay
->optimizer_args.weight_decay
warmup_updates
->lr_scheduler_args.warmup_steps
decay_steps
=>lr_scheduler_args.step_size
gamma
=>lr_scheduler_args.gamma
DS files
DS files in v1.x format are not supported now. Please export them again with the latest version of OpenUTAU for DiffSinger for inference.
The new variance models, parameters and mechanisms
Variance models
Training, inference and deployment of the new variance models are supported.
Functionalities included:
- Automatically predicts phoneme durations (Duration Predictor)
- Automatically predicts the pitch curve (Pitch Diffusion)
- Automatically predicts other variance parameters jointly (Multi-Variance Diffusion)
Before training variance models, the current data transcriptions should be migrated. Required operations may vary according to functionalities chosen and dictionaries used (sometimes requires manual labeling). See details at: variance-temp-solution.
Phoneme durations
In acoustic models, the users need to input duration for every phoneme, thus the acoustic model relies on phoneme duration predictors. The phoneme duration prediction modules in variance models can predict duration for every phoneme with given phoneme sequence, word division, word duration and approximate MIDI sequence.
Pitch curve
Acoustic models require explicit pitch input from outside. Pitch prediction modules in the variance models can predict pitch curve with given phoneme information and smoothened MIDI curve. The specially designed labeling system can correct bad data with many out-of-tune errors and produce accurate models.
Variance parameters
Variance parameters can bring higher expressiveness and controllability besides phoneme durations and pitch. They are predicted by the variance model with given phoneme information and pitch curve, then given to the acoustic model for the control.
NOTE: Variance parameters are represented by absolute values instead of relative values (offset), thus no default value curves are defined. For this reason, new acoustic models that accept these parameters as input should be trained besides their corresponding variance models.
Energy
Energy is defined as the RMS curve of the singing, in dB, which can control the strength of voice to a certain extent.
In DS files, energy
and energy_timestep
are used to control energy.
Breathiness
Breathiness is defined as the RMS curve of the aperiodic part of the singing, in dB, which can control the power of the air and unvoiced consonants in the voice.
In DS files, breathiness
and breathiness_timestep
are used to control breathiness.
Style fusion mechanism
All parameters that variance models support can be dynamically style-mixed. Among them, the phoneme durations are mixed in the level of phonemes, while others are mixed in the level of frames. Style fusion of different parameters as well as style fusion and timbre fusion, are independent from each other.
Style fusion controls are similar to the timbre fusion of acoustic models:
ph_spk_mix
is used to control the fusion of phoneme durations.spk_mix
andspk_mix_timestep
are used to control the fusion of other parameters.
Local retaking mechanism
Pitch and all other variance parameters support local retaking, i.e. re-generate curve on a continuous sub-region based on given curve segments. Meanwhile, this mechanism ensures that the retaken curve is smoothly connected to the given curve.
To retake pitch, complete phoneme information, position of the region to be retaken and the pitch curve on non-retaking regions should be given.
To retake other variance parameters, complete phoneme information, complete pitch curve, variance parameter names to retake (retaking multiple parameters in one go is supported), positions of the regions to be retaken (retaking different parameters on different regions is supported) and the parameters curves of non-retaking regions should be given.
Parameter cascading mechanism
Overall cascading logic
The cascading order of the variance model in general is: music scores => phoneme durations => pitch => other variance parameters.
Customized variance cascading
By utilizing the local retaking mechanism, the cascading order of all variance parameters except pitch can be customized at inference time. The following shows some example use cases:
- Jointly predicts parameter A, B and C in one go while they do not couple with each other, thus changing one parameter will not cause others to change.
- Make parameter C to be after parameter A and B, i.e., Jointly predicts A and B in one go, then use A and B to predict C. Parameter C will change on modification to parameter A and B, but parameter A and B do not influence each other.
- Freeze parameter A and make parameter B and C coupled. Use A to predict B and C, and modifying either B and C will cause the other to change.
Universal dictionary and phoneme system support
The brand-new variance model and phoneme labeling system can support any dictionaries of any phoneme systems. See the variance labels migration guidelines (variance-temp-solution) and custom dictionary guidelines for more details.
Automatic mixed precision, multi-GPU and gradient accumulation
This project has been adapted to the latest version of Lightning and supports automatic mixed precision (FP16/BF16 AMP), multi-GPU training (DDP) and gradient accumulation for accelerating training and saving GPU memory. See performance tuning guidelines for more details.
Other new contents and changes
- Documentation of this project has been refactored. The README lists all important documents and links.
- Code structure and dependencies are significantly refactored. Some dependencies are updated.
- Scripts for preprocessing, training, inference and deployment have been refactored and moved under scripts/.
- A new script for deleting specific speaker embedding from model checkpoints is added.
PYTHONPATH
andCUDA_VISIBLE_DEVICES
are not required to be exported now when running preprocessing and training.- The speaker IDs distributed to each dataset can now be customized via the
spk_ids
configuration key. Giving the same ID to multiple datasets is also supported. - Multiprocessing binarization is now supported. The number of workers can be customized.
- Dataset binary format has been changed to HDF5. Redundant contents are removed.
- The learning rate and the optimizer can now be customized more freely via
lr_scheduler_args
andoptimizer_args
. - DDIM, DPM-Solver++ (replacement of DPM-Solver) and UniPC algorithms are supported for diffusion sampling acceleration.
- The diffusion accelerator integrated in ONNX models has been changed to DDIM.
- When exporting multi-speaker models, all speakers will be exported by default if the
--export_spk
option is unset. - Version of the operator set of exported ONNX models has been upgraded to 15.
Some changes may not be listed above. See the repository README for more details.
Bug fixes
- Fixed a bug causing the epoch count in the terminal logging to be only 1/1000 of the actual epoch.
- Fixed potential file handle racing when reading/seeking dataset.
- Fixed a bug causing inconsistency between joint augmentation formula and implementation.
- Fixed hyper-parameters failed to render colors in some terminal and some Python versions.
- Fixed messed up code backup directory structure.
License
License of this project has been changed from the MIT License to the Apache Licnese 2.0.