-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KSS dataset for 490 epochs but the quality is not as good as I expected #4
Comments
@ggpid Please let me know the outcome of the debug test, because I am interested in KSS dataset performance as well. Thanks. |
Alongside with p0p4k's suggestion, I found there was an issue of getting slower duration by reducing the sampling rate to 16000Hz. Also, there are repos especially targeting the 44100 Hz sampling rate. Which share doubled segment size, fft sizes, etc. in common. (changes stated as below)
|
I've bumped into this issue a couple of times while training 44.1kHz TTS models. To fix your issue, you have to modify the config such that the product of At the moment, you have This 44.1kHz config here is correct; it has So there are a couple of ways to fix your issue. The easiest option would be to just follow the config linked above, but if you still want to use a smaller hop length of 256, then I'd suggest you use:
which is the default setup from here. Hope this helps! |
@w11wo Thank you for the detailed explanation. But I still wonder if it is okay to go with those settings from original vits because |
To be honest, I'm not too sure which of the changes you've suggested will work in the end, though they seem reasonable. It's probably a good idea to have assertion checks in place to avoid cases like this, i.e. ensuring that the config used by the user makes sense calculation-wise. |
hi all! |
First of all, thank you for sharing such a wonderful code.
I trained using the KSS dataset for 490 epochs, but the quality is not as good as I expected.
It seems that the TTS speaks a bit fast.
wav
What might have gone wrong during the training?
{
"train": {
"log_interval": 200,
"eval_interval": 3000,
"seed": 1234,
"epochs": 20000,
"learning_rate": 2e-4,
"betas": [0.8, 0.99],
"eps": 1e-9,
"batch_size": 32,
"fp16_run": false,
"lr_decay": 0.999875,
"segment_size": 8192,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"fft_sizes": [384, 683, 171],
"hop_sizes": [30, 60, 10],
"win_lengths": [150, 300, 60],
"window": "hann_window"
},
"data": {
"use_mel_posterior_encoder": true,
"training_files":"kss/kss_cjke_train.txt.cleaned",
"validation_files":"kss/kss_cjke_val.txt.cleaned",
"text_cleaners":["cjke_cleaners2"],
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 0,
"cleaned_text": true
},
"model": {
"use_mel_posterior_encoder": true,
"use_transformer_flows": true,
"transformer_flow_type": "pre_conv",
"use_spk_conditioned_encoder": false,
"use_noise_scaled_mas": true,
"use_duration_discriminator": true,
"ms_istft_vits": false,
"mb_istft_vits": true,
"istft_vits": false,
"subbands": 4,
"gen_istft_n_fft": 16,
"gen_istft_hop_size": 4,
"inter_channels": 192,
"hidden_channels": 96,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 3,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"upsample_rates": [4,4],
"upsample_initial_channel": 256,
"upsample_kernel_sizes": [16,16],
"n_layers_q": 3,
"use_spectral_norm": false,
"use_sdp": false
}
}
The text was updated successfully, but these errors were encountered: