DiffSinger uses a cascading configuration system based on YAML files. All configuration files originally inherit and override configs/base.yaml, and each file directly override another file by setting the base_config
attribute. The overriding rules are:
- Configuration keys with the same path and the same name will be replaced. Other paths and names will be merged.
- All configurations in the inheritance chain will be squashed (via the rule above) as the final configuration.
- The trainer will save the final configuration in the experiment directory, which is detached from the chain and made independent from other configuration files.
This following are the meaning and usages of all editable keys in a configuration file.
Each configuration key (including nested keys) are described with a brief explanation and several attributes listed as follows:
Attribute | Explanation |
---|---|
visibility | Represents what kind(s) of models and tasks this configuration belongs to. |
scope | The scope of effects of the configuration, indicating what it can influence within the whole pipeline. Possible values are: nn - This configuration is related to how the neural networks are formed and initialized. Modifying it will result in failure when loading or resuming from checkpoints. preprocessing - This configuration controls how raw data pieces or inference inputs are converted to inputs of neural networks. Binarizers should be re-run if this configuration is modified. training - This configuration describes the training procedures. Most training configurations can affect training performance, memory consumption, device utilization and loss calculation. Modifying training-only configurations will not cause severe inconsistency or errors in most situations. inference - This configuration describes the calculation logic through the model graph. Changing it can lead to inconsistent or wrong outputs of inference or validation. others - Other configurations not discussed above. Will have different effects according to the descriptions. |
customizability | The level of customizability of the configuration. Possible values are: required - This configuration must be set or modified according to the actual situation or condition, otherwise errors can be raised. recommended - It is recommended to adjust this configuration according to the dataset, requirements, environment and hardware. Most functionality-related and feature-related configurations are at this level, and all configurations in this level are widely tested with different values. However, leaving it unchanged will not cause problems. normal - There is no need to modify it as the default value is carefully tuned and widely validated. However, one can still use another value if there are some special requirements or situations. not recommended - No other values except the default one of this configuration are tested. Modifying it will not cause errors, but may cause unpredictable or significant impacts to the pipelines. reserved - This configuration must not be modified. It appears in the configuration file only for future scalability, and currently changing it will result in errors. |
type | Value type of the configuration. Follows the syntax of Python type hints. |
constraints | Value constraints of the configuration. |
default | Default value of the configuration. Uses YAML value syntax. |
Indicates that gradients of how many training steps are accumulated before each optimizer.step()
call. 1 means no gradient accumulation.
visibility | all |
scope | training |
customizability | recommended |
type | int |
default | 1 |
Number of mel channels for the mel-spectrogram.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | reserved |
type | int |
default | 128 |
Sampling rate of waveforms.
visibility | acoustic, variance |
scope | preprocessing |
customizability | reserved |
type | int |
default | 44100 |
Arguments for data augmentation.
type | dict |
Arguments for fixed pitch shifting augmentation.
type | dict |
Whether to apply fixed pitch shifting augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | bool |
default | false |
constraints | Must be false if augmentation_args.random_pitch_shifting.enabled is set to true. |
Scale ratio of each target in fixed pitch shifting augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | tuple |
default | 0.5 |
Targets (in semitones) of fixed pitch shifting augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | not recommended |
type | tuple |
default | [-5.0, 5.0] |
Arguments for random pitch shifting augmentation.
type | dict |
Whether to apply random pitch shifting augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | bool |
default | true |
constraints | Must be false if augmentation_args.fixed_pitch_shifting.enabled is set to true. |
Range of the random pitch shifting ( in semitones).
visibility | acoustic |
scope | preprocessing |
customizability | not recommended |
type | tuple |
default | [-5.0, 5.0] |
Scale ratio of the random pitch shifting augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | float |
default | 0.75 |
Arguments for random time stretching augmentation.
type | dict |
Whether to apply random time stretching augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | bool |
default | true |
Range of random time stretching factors.
visibility | acoustic |
scope | preprocessing |
customizability | not recommended |
type | tuple |
default | [0.5, 2] |
Scale ratio of random time stretching augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | float |
default | 0.75 |
Keyword arguments for the backbone of main decoder module.
visibility | acoustic, variance |
scope | nn |
type | dict |
Some available arguments are listed below.
argument name | for backbone type | description |
---|---|---|
num_layers | wavenet/lynxnet | Number of layer blocks, or depth of the network |
num_channels | wavenet/lynxnet | Number of channels, or width of the network |
dilation_cycle_length | wavenet | Length k of the cycle |
Backbone type of the main decoder/predictor module.
visibility | acoustic, variance |
scope | nn |
customizability | normal |
type | str |
default | lynxnet |
constraints | Choose from 'wavenet', 'lynxnet'. |
Path(s) of other config files that the current config is based on and will override.
scope | others |
type | Union[str, list] |
Arguments for binarizers.
type | dict |
Number of worker subprocesses when running binarizers. More workers can speed up the preprocessing but will consume more memory. 0 means the main processing doing everything.
visibility | all |
scope | preprocessing |
customizability | recommended |
type | int |
default | 1 |
Whether to prefer loading attributes and parameters from DS files.
visibility | variance |
scope | preprocessing |
customizability | recommended |
type | bool |
default | False |
Whether binarized dataset will be shuffled or not.
visibility | all |
scope | preprocessing |
customizability | normal |
type | bool |
default | true |
Binarizer class name.
visibility | all |
scope | preprocessing |
customizability | reserved |
type | str |
Path to the binarized dataset.
visibility | all |
scope | preprocessing, training |
customizability | required |
type | str |
Maximum breathiness value in dB used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -20.0 |
Minimum breathiness value in dB used for normalization to [-1, 1].
visibility | acoustic, variance |
scope | inference |
customizability | recommended |
type | float |
default | -96.0 |
Length of sinusoidal smoothing convolution kernel (in seconds) on extracted breathiness curve.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | float |
default | 0.12 |
The value at which to clip gradients. Equivalent to gradient_clip_val
in lightning.pytorch.Trainer
.
visibility | all |
scope | training |
customizability | not recommended |
type | float |
default | 1 |
Number of batches loaded in advance by each torch.utils.data.DataLoader
worker.
visibility | all |
scope | training |
customizability | normal |
type | bool |
default | true |
The key that indexes the binarized metadata to be used as the sizes
when batching by size
visibility | all |
scope | training |
customizability | not recommended |
type | str |
default | lengths |
Path to the word-phoneme mapping dictionary file. Training data must fully cover phonemes in the dictionary.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | str |
DDPM sampling acceleration method. The following methods are currently available:
- DDIM: the DDIM method from Denoising Diffusion Implicit Models
- PNDM: the PLMS method from Pseudo Numerical Methods for Diffusion Models on Manifolds
- DPM-Solver++ adapted from DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps
- UniPC adapted from UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models
visibility | acoustic, variance |
scope | inference |
customizability | normal |
type | str |
default | dpm-solver |
constraints | Choose from 'ddim', 'pndm', 'dpm-solver', 'unipc'. |
DDPM sampling speed-up ratio. 1 means no speeding up.
visibility | acoustic, variance |
scope | inference |
customizability | normal |
type | int |
default | 10 |
constraints | Must be a factor of K_step. |
The type of ODE-based generative model algorithm. The following models are currently available:
- Denoising Diffusion Probabilistic Models (DDPM) from Denoising Diffusion Probabilistic Models
- Rectified Flow from Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
visibility | acoustic, variance |
scope | nn |
customizability | normal |
type | str |
default | reflow |
constraints | Choose from 'ddpm', 'reflow'. |
Dropout rate in some FastSpeech2 modules.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | float |
default | 0.1 |
Number of workers of torch.utils.data.DataLoader
.
visibility | all |
scope | training |
customizability | normal |
type | int |
default | 4 |
Arguments for phoneme duration prediction.
type | dict |
Architecture of duration predictor.
visibility | variance |
scope | nn |
customizability | reserved |
type | str |
default | fs2 |
constraints | Choose from 'fs2'. |
Dropout rate in duration predictor of FastSpeech2.
visibility | variance |
scope | nn |
customizability | not recommended |
type | float |
default | 0.1 |
dur_prediction_args.hidden_size
Dimensions of hidden layers in duration predictor of FastSpeech2.
visibility | variance |
scope | nn |
customizability | normal |
type | int |
default | 512 |
Kernel size of convolution layers of duration predictor of FastSpeech2.
visibility | variance |
scope | nn |
customizability | normal |
type | int |
default | 3 |
Coefficient of single phone duration loss when calculating joint duration loss.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 0.3 |
Coefficient of sentence duration loss when calculating joint duration loss.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 3.0 |
Coefficient of word duration loss when calculating joint duration loss.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 1.0 |
Offset for log domain duration loss calculation, where the following transformation is applied:
$$
D' = \ln{(D+d)}
$$
with the offset value
visibility | variance |
scope | training |
customizability | not recommended |
type | float |
default | 1.0 |
Underlying loss type of duration loss.
visibility | variance |
scope | training |
customizability | normal |
type | str |
default | mse |
constraints | Choose from 'mse', 'huber'. |
Number of duration predictor layers.
visibility | variance |
scope | nn |
customizability | normal |
type | int |
default | 5 |
Size of TransformerFFNLayer convolution kernel size in FastSpeech2 encoder.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | int |
default | 9 |
Number of FastSpeech2 encoder layers.
visibility | acoustic, variance |
scope | nn |
customizability | normal |
type | int |
default | 4 |
Maximum energy value in dB used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -12.0 |
Minimum energy value in dB used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -96.0 |
Length of sinusoidal smoothing convolution kernel (in seconds) on extracted energy curve.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | float |
default | 0.12 |
Maximum base frequency (F0) in Hz for pitch extraction.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | int |
default | 1100 |
Minimum base frequency (F0) in Hz for pitch extraction.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | int |
default | 65 |
Activation function of TransformerFFNLayer in FastSpeech2 encoder:
torch.nn.ReLU
if 'relu'torch.nn.GELU
if 'gelu'torch.nn.SiLU
if 'swish'
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | str |
default | gelu |
constraints | Choose from 'relu', 'gelu', 'swish'. |
Fast Fourier Transforms parameter for mel extraction.
visibility | acoustic, variance |
scope | preprocessing |
customizability | reserved |
type | int |
default | 2048 |
Whether to finetune from a pretrained model.
visibility | all |
scope | training |
customizability | normal |
type | bool |
default | False |
Path to the pretrained model for finetuning.
visibility | all |
scope | training |
customizability | normal |
type | str |
default | null |
Prefixes of parameter key names in the state dict of the pretrained model that need to be dropped before finetuning.
visibility | all |
scope | training |
customizability | normal |
type | list |
Whether to raise error if the tensor shapes of any parameter of the pretrained model and the target model mismatch. If set to False
, parameters with mismatching shapes will be skipped.
visibility | all |
scope | training |
customizability | normal |
type | bool |
default | True |
Maximum frequency of mel extraction.
visibility | acoustic |
scope | preprocessing |
customizability | reserved |
type | int |
default | 16000 |
Minimum frequency of mel extraction.
visibility | acoustic |
scope | preprocessing |
customizability | reserved |
type | int |
default | 40 |
Whether enabling parameter freezing during training.
visibility | all |
scope | training |
customizability | normal |
type | bool |
default | False |
Parameter name prefixes to freeze during training.
visibility | all |
scope | training |
customizability | normal |
type | list |
default | [] |
The scale factor to be multiplied on the glide embedding values for melody encoder.
visibility | variance |
scope | nn |
customizability | not recommended |
type | float |
default | 11.313708498984760 |
Type names of glide notes.
visibility | variance |
scope | preprocessing |
customizability | normal |
type | list |
default | [up, down] |
hidden_size
Dimension of hidden layers of FastSpeech2, token and parameter embeddings, and diffusion condition.
visibility | acoustic, variance |
scope | nn |
customizability | normal |
type | int |
default | 256 |
Harmonic-noise separation algorithm type.
visibility | all |
scope | preprocessing |
customizability | normal |
type | str |
default | world |
constraints | Choose from 'world', 'vr'. |
Checkpoint or model path of NN-based harmonic-noise separator.
visibility | all |
scope | preprocessing |
customizability | normal |
type | str |
Hop size or step length (in number of waveform samples) of mel and feature extraction.
visibility | acoustic, variance |
scope | preprocessing |
customizability | reserved |
type | int |
default | 512 |
Coefficient of aux mel loss when calculating total loss of acoustic model with shallow diffusion.
visibility | acoustic |
scope | training |
customizability | normal |
type | float |
default | 0.2 |
Coefficient of duration loss when calculating total loss of variance model.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 1.0 |
Coefficient of pitch loss when calculating total loss of variance model.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 1.0 |
Coefficient of variance loss (all variance parameters other than pitch, like energy, breathiness, etc.) when calculating total loss of variance model.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 1.0 |
Maximum number of DDPM steps used by shallow diffusion.
visibility | acoustic |
scope | training |
customizability | recommended |
type | int |
default | 400 |
Number of DDPM steps used during shallow diffusion inference. Normally set as same as K_step.
visibility | acoustic |
scope | inference |
customizability | recommended |
type | int |
default | 400 |
constraints | Should be no larger than K_step. |
Controls how often to log within training steps. Equivalent to log_every_n_steps
in lightning.pytorch.Trainer
.
visibility | all |
scope | training |
customizability | normal |
type | int |
default | 100 |
Arguments of learning rate scheduler. Keys will be used as keyword arguments of the __init__()
method of lr_scheduler_args.scheduler_cls.
type | dict |
Learning rate scheduler class name.
visibility | all |
scope | training |
customizability | not recommended |
type | str |
default | torch.optim.lr_scheduler.StepLR |
Whether to use log-normalized weight for the main loss. This is similar to the method in the Stable Diffusion 3 paper Scaling Rectified Flow Transformers for High-Resolution Image Synthesis.
visibility | acoustic, variance |
scope | training |
customizability | normal |
type | bool |
Loss type of the main decoder/predictor.
visibility | acoustic, variance |
scope | training |
customizability | not recommended |
type | str |
default | l2 |
constraints | Choose from 'l1', 'l2'. |
Maximum number of data frames in each training batch. Used to dynamically control the batch size.
visibility | acoustic, variance |
scope | training |
customizability | recommended |
type | int |
default | 80000 |
The maximum training batch size.
visibility | all |
scope | training |
customizability | recommended |
type | int |
default | 48 |
Max beta of the DDPM noise schedule.
visibility | acoustic, variance |
scope | nn, inference |
customizability | normal |
type | float |
default | 0.02 |
Stop training after this number of steps. Equivalent to max_steps
in lightning.pytorch.Trainer
.
visibility | all |
scope | training |
customizability | recommended |
type | int |
default | 320000 |
Maximum number of data frames in each validation batch.
visibility | acoustic, variance |
scope | training |
customizability | normal |
type | int |
default | 60000 |
The maximum validation batch size.
visibility | all |
scope | training |
customizability | normal |
type | int |
default | 1 |
The logarithmic base of mel spectrogram calculation.
WARNING: Since v2.4.0 release, this value is no longer configurable for preprocessing new datasets.
visibility | acoustic |
scope | preprocessing |
customizability | reserved |
type | str |
default | e |
Maximum mel spectrogram heatmap value for TensorBoard plotting.
visibility | acoustic |
scope | training |
customizability | not recommended |
type | float |
default | 1.5 |
Minimum mel spectrogram heatmap value for TensorBoard plotting.
visibility | acoustic |
scope | training |
customizability | not recommended |
type | float |
default | -6.0 |
Arguments for melody encoder. Available sub-keys: hidden_size
, enc_layers
, enc_ffn_kernel_size
, ffn_act
, dropout
, num_heads
, use_pos_embed
, rel_pos
. If either of the parameter does not exist in this configuration key, it inherits from the linguistic encoder.
type | dict |
Length of sinusoidal smoothing convolution kernel (in seconds) on the step function representing MIDI sequence for base pitch calculation.
visibility | variance |
scope | preprocessing |
customizability | normal |
type | float |
default | 0.06 |
Whether to enable P2P when using NCCL as the backend. Turn it to false
if the training process is stuck upon beginning.
visibility | all |
scope | training |
customizability | normal |
type | bool |
default | true |
Number of newest checkpoints kept during training.
visibility | all |
scope | training |
customizability | normal |
type | int |
default | 5 |
The number of attention heads of torch.nn.MultiheadAttention
in FastSpeech2 encoder.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | int |
default | 2 |
Number of sanity validation steps at the beginning.
visibility | all |
scope | training |
customizability | reserved |
type | int |
default | 1 |
Maximum number of speakers in multi-speaker models.
visibility | acoustic, variance |
scope | nn |
customizability | required |
type | int |
default | 1 |
Number of validation plots in each validation. Plots will be chosen from the start of the validation set.
visibility | acoustic, variance |
scope | training |
customizability | recommended |
type | int |
default | 10 |
Arguments of optimizer. Keys will be used as keyword arguments of the __init__()
method of optimizer_args.optimizer_cls.
type | dict |
Optimizer class name
visibility | all |
scope | training |
customizability | reserved |
type | str |
default | torch.optim.AdamW |
Pitch extraction algorithm type.
visibility | all |
scope | preprocessing |
customizability | normal |
type | str |
default | parselmouth |
constraints | Choose from 'parselmouth', 'rmvpe', 'harvest'. |
Checkpoint or model path of NN-based pitch extractor.
visibility | all |
scope | preprocessing |
customizability | normal |
type | str |
The interval (in number of training steps) of permanent checkpoints. Permanent checkpoints will not be removed even if they are not the newest ones.
visibility | all |
scope | training |
type | int |
default | 40000 |
Checkpoints will be marked as permanent every permanent_ckpt_interval training steps after this number of training steps.
visibility | all |
scope | training |
type | int |
default | 120000 |
Arguments for pitch prediction.
type | dict |
Equivalent to backbone_args but only for the pitch predictor model. If not set, use the root backbone type.
visibility | variance |
Equivalent to backbone_type but only for the pitch predictor model.
visibility | variance |
default | wavenet |
Maximum clipping value (in semitones) of pitch delta between actual pitch and base pitch.
visibility | variance |
scope | inference |
type | float |
default | 12.0 |
Minimum clipping value (in semitones) of pitch delta between actual pitch and base pitch.
visibility | variance |
scope | inference |
type | float |
default | -12.0 |
Maximum pitch delta value in semitones used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | 8.0 |
Minimum pitch delta value in semitones used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -8.0 |
Number of repeating bins in the pitch predictor.
visibility | variance |
scope | nn, inference |
customizability | recommended |
type | int |
default | 64 |
Type of Lightning trainer hardware accelerator.
visibility | all |
scope | training |
customizability | not recommended |
type | str |
default | auto |
constraints | See Accelerator — PyTorch Lightning 2.X.X documentation for available values. |
To determine on which device(s) model should be trained.
'auto' will utilize all visible devices defined with the CUDA_VISIBLE_DEVICES
environment variable, or utilize all available devices if that variable is not set. Otherwise, it behaves like CUDA_VISIBLE_DEVICES
which can filter out visible devices.
visibility | all |
scope | training |
customizability | not recommended |
type | str |
default | auto |
The computation precision of training.
visibility | all |
scope | training |
customizability | normal |
type | str |
default | 16-mixed |
constraints | Choose from '32-true', 'bf16-mixed', '16-mixed'. See more possible values at Trainer — PyTorch Lightning 2.X.X documentation. |
Number of nodes in the training cluster of Lightning trainer.
visibility | all |
scope | training |
customizability | reserved |
type | int |
default | 1 |
Arguments of Lightning Strategy. Values will be used as keyword arguments when constructing the Strategy object.
type | dict |
Strategy name for the Lightning trainer.
visibility | all |
scope | training |
customizability | reserved |
type | str |
default | auto |
Whether to enable breathiness prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | false |
Whether to enable phoneme duration prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | true |
Whether to enable energy prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | false |
Whether to enable pitch prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | true |
Whether to enable tension prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | true |
Whether to enable voicing prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | true |
Path(s) to the raw dataset including wave files, transcriptions, etc.
visibility | all |
scope | preprocessing |
customizability | required |
type | str, List[str] |
Whether to use relative positional encoding in FastSpeech2 module.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | boolean |
default | true |
The batch sampler applies an algorithm called sorting by similar length when collecting batches. Data samples are first grouped by their approximate lengths before they get shuffled within each group. Assume this value is set to
Training performance on some datasets may be very sensitive to this value. Change it to 1 (completely sorted by length without shuffling) to get the best performance in theory.
visibility | acoustic, variance |
scope | training |
customizability | normal |
type | int |
default | 6 |
The algorithm to solve the ODE of Rectified Flow. The following methods are currently available:
- Euler: The Euler method.
- Runge-Kutta (order 2): The 2nd-order Runge-Kutta method.
- Runge-Kutta (order 4): The 4th-order Runge-Kutta method.
- Runge-Kutta (order 5): The 5th-order Runge-Kutta method.
visibility | acoustic, variance |
scope | inference |
customizability | normal |
type | str |
default | euler |
constraints | Choose from 'euler', 'rk2', 'rk4', 'rk5'. |
The total sampling steps to solve the ODE of Rectified Flow. Note that this value may not equal to NFE (Number of Function Evaluations) because some methods may require more than one function evaluation per step.
visibility | acoustic, variance |
scope | inference |
customizability | normal |
type | int |
default | 20 |
The DDPM schedule type.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | str |
default | linear |
constraints | Choose from 'linear', 'cosine'. |
Arguments for shallow diffusion.
type | dict |
Architecture type of the auxiliary decoder.
visibility | acoustic |
scope | nn |
customizability | reserved |
type | str |
default | convnext |
constraints | Choose from 'convnext'. |
Keyword arguments for dynamically constructing the auxiliary decoder.
visibility | acoustic |
scope | nn |
type | dict |
Scale factor of the gradients from the auxiliary decoder to the encoder.
visibility | acoustic |
scope | training |
customizability | normal |
type | float |
default | 0.1 |
Whether to forward and backward the auxiliary decoder during training. If set to false
, the auxiliary decoder hangs in the memory and does not get any updates.
visibility | acoustic |
scope | training |
customizability | normal |
type | bool |
default | true |
Whether to forward and backward the diffusion (main) decoder during training. If set to false
, the diffusion decoder hangs in the memory and does not get any updates.
visibility | acoustic |
scope | training |
customizability | normal |
type | bool |
default | true |
Whether to use the ground truth as x_start
in the shallow diffusion validation process. If set to true
, gaussian noise is added to the ground truth before shallow diffusion is performed; otherwise the noise is added to the output of the auxiliary decoder. This option is useful when the auxiliary decoder has not been trained yet.
visibility | acoustic |
scope | training |
customizability | normal |
type | bool |
default | false |
Whether to apply the sorting by similar length algorithm described in sampler_frame_count_grid. Turning off this option may slow down training because sorting by length can better utilize the computing resources.
visibility | acoustic, variance |
scope | training |
customizability | not recommended |
type | bool |
default | true |
The names of speakers in a multi-speaker model. Speaker names are mapped to speaker indexes and stored into spk_map.json when preprocessing.
visibility | acoustic, variance |
scope | preprocessing |
customizability | required |
type | list |
The IDs of speakers in a multi-speaker model. If an empty list is given, speaker IDs will be automatically generated as
visibility | acoustic, variance |
scope | preprocessing |
customizability | required |
type | List[int] |
default | [] |
Minimum mel spectrogram value used for normalization to [-1, 1]. Different mel bins can have different minimum values.
visibility | acoustic |
scope | inference |
customizability | not recommended |
type | List[float] |
default | [-5.0] |
Maximum mel spectrogram value used for normalization to [-1, 1]. Different mel bins can have different maximum values.
visibility | acoustic |
scope | inference |
customizability | not recommended |
type | List[float] |
default | [0.0] |
The starting value of time
visibility | acoustic |
scope | training |
customizability | recommended |
type | float |
default | 0.4 |
The starting value of time
visibility | acoustic |
scope | inference |
customizability | recommended |
type | float |
default | 0.4 |
constraints | Should be no less than T_start. |
Task trainer class name.
visibility | all |
scope | training |
customizability | reserved |
type | str |
Maximum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | 10.0 |
Minimum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -10.0 |
Length of sinusoidal smoothing convolution kernel (in seconds) on extracted tension curve.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | float |
default | 0.12 |
List of data item names or name prefixes for the validation set. For each string s
in the list:
- If
s
equals to an actual item name, add that item to validation set. - If
s
does not equal to any item names, add all items whose names start withs
to validation set.
For multi-speaker combined datasets, "ds_id:name_prefix" can be used to apply the rules above within data from a specific sub-dataset, where ds_id represents the dataset index.
visibility | all |
scope | preprocessing |
customizability | required |
type | list |
The scale factor that will be multiplied on the time
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | float |
default | 1000 |
Total number of DDPM steps.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | int |
default | 1000 |
Whether to accept and embed breathiness values into the model.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
Whether to accept and embed energy values into the model.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
Whether to accept and embed glide types in melody encoder.
visibility | variance |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
constraints | Only take affects when melody encoder is enabled. |
Whether to embed key shifting values introduced by random pitch shifting augmentation.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
constraints | Must be true if random pitch shifting is enabled. |
Whether to enable melody encoder for the pitch predictor.
visibility | variance |
scope | nn |
customizability | recommended |
type | boolean |
default | false |
Whether to use SinusoidalPositionalEmbedding in FastSpeech2 encoder.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | boolean |
default | true |
Whether to use shallow diffusion.
visibility | acoustic |
scope | nn, inference |
customizability | recommended |
type | boolean |
default | false |
Whether to embed speed values introduced by random time stretching augmentation.
visibility | acoustic |
scope | nn, preprocessing, inference |
type | boolean |
default | false |
constraints | Must be true if random time stretching is enabled. |
Whether embed the speaker id from a multi-speaker dataset.
visibility | acoustic, variance |
scope | nn, preprocessing, inference |
customizability | recommended |
type | bool |
default | false |
Whether to accept and embed tension values into the model.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
Whether to accept and embed voicing values into the model.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
Interval (in number of training steps) between validation checks.
visibility | all |
scope | training |
customizability | recommended |
type | int |
default | 2000 |
Whether to load and use the vocoder to generate audio during validation. Validation audio will not be available if this option is disabled.
visibility | acoustic |
scope | training |
customizability | normal |
type | bool |
default | true |
Arguments for prediction of variance parameters other than pitch, like energy, breathiness, etc.
type | dict |
Equivalent to backbone_args but only for the multi-variance predictor.
visibility | variance |
Equivalent to backbone_type but only for the multi-variance predictor model. If not set, use the root backbone type.
visibility | variance |
default | wavenet |
Total number of repeating bins in the multi-variance predictor. Repeating bins are distributed evenly to each variance parameter.
visibility | variance |
scope | nn, inference |
customizability | recommended |
type | int |
default | 48 |
The vocoder class name.
visibility | acoustic |
scope | preprocessing, training, inference |
customizability | normal |
type | str |
default | NsfHifiGAN |
Path of the vocoder model.
visibility | acoustic |
scope | preprocessing, training, inference |
customizability | normal |
type | str |
default | checkpoints/nsf_hifigan/model |
Maximum voicing value in dB used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -20.0 |
Minimum voicing value in dB used for normalization to [-1, 1].
visibility | acoustic, variance |
scope | inference |
customizability | recommended |
type | float |
default | -96.0 |
Length of sinusoidal smoothing convolution kernel (in seconds) on extracted voicing curve.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | float |
default | 0.12 |
Window size for mel or feature extraction.
visibility | acoustic, variance |
scope | preprocessing |
customizability | reserved |
type | int |
default | 2048 |