Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KSS dataset for 490 epochs but the quality is not as good as I expected #4

Open
ggpid opened this issue Sep 11, 2023 · 6 comments
Open

Comments

@ggpid
Copy link

ggpid commented Sep 11, 2023

First of all, thank you for sharing such a wonderful code.
I trained using the KSS dataset for 490 epochs, but the quality is not as good as I expected.
It seems that the TTS speaks a bit fast.
wav
What might have gone wrong during the training?

{
"train": {
"log_interval": 200,
"eval_interval": 3000,
"seed": 1234,
"epochs": 20000,
"learning_rate": 2e-4,
"betas": [0.8, 0.99],
"eps": 1e-9,
"batch_size": 32,
"fp16_run": false,
"lr_decay": 0.999875,
"segment_size": 8192,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"fft_sizes": [384, 683, 171],
"hop_sizes": [30, 60, 10],
"win_lengths": [150, 300, 60],
"window": "hann_window"
},
"data": {
"use_mel_posterior_encoder": true,
"training_files":"kss/kss_cjke_train.txt.cleaned",
"validation_files":"kss/kss_cjke_val.txt.cleaned",
"text_cleaners":["cjke_cleaners2"],
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 0,
"cleaned_text": true
},
"model": {
"use_mel_posterior_encoder": true,
"use_transformer_flows": true,
"transformer_flow_type": "pre_conv",
"use_spk_conditioned_encoder": false,
"use_noise_scaled_mas": true,
"use_duration_discriminator": true,
"ms_istft_vits": false,
"mb_istft_vits": true,
"istft_vits": false,
"subbands": 4,
"gen_istft_n_fft": 16,
"gen_istft_hop_size": 4,
"inter_channels": 192,
"hidden_channels": 96,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 3,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"upsample_rates": [4,4],
"upsample_initial_channel": 256,
"upsample_kernel_sizes": [16,16],
"n_layers_q": 3,
"use_spectral_norm": false,
"use_sdp": false
}
}

@p0p4k
Copy link

p0p4k commented Sep 11, 2023

@ggpid Please let me know the outcome of the debug test, because I am interested in KSS dataset performance as well. Thanks.
(cc. p0p4k/vits2_pytorch#49)

@ggpid ggpid closed this as completed Sep 12, 2023
@FENRlR
Copy link
Owner

FENRlR commented Sep 13, 2023

Alongside with p0p4k's suggestion, I found there was an issue of getting slower duration by reducing the sampling rate to 16000Hz.
MasayaKawamura/MB-iSTFT-VITS#7 (comment)

Also, there are repos especially targeting the 44100 Hz sampling rate.
https://github.com/tonnetonne814/MB-iSTFT-VITS-44100-Ja/blob/main/configs/jsut_44100.json
https://github.com/tonnetonne814/unofficial-vits2-44100-Ja/blob/main/configs/vits2_jsut_nosdp.json

Which share doubled segment size, fft sizes, etc. in common. (changes stated as below)
(+ subbands which is mb/ms-istft exclusive)

{
  "train": {
    "segment_size": 16384,
    "fft_sizes": [768, 1366, 342],
    "hop_sizes": [60, 120, 20], 
    "win_lengths": [300, 600, 120],
  },
  "data": {
    "sampling_rate": 44100,
    "filter_length": 2048,
    "hop_length": 512, 
    "win_length": 2048, 
    "add_blank": false, 
  },
  "model": {
      "subbands": 8,
      "upsample_initial_channel": 512,
    }
}

@FENRlR FENRlR reopened this Sep 13, 2023
@w11wo
Copy link

w11wo commented Sep 14, 2023

I've bumped into this issue a couple of times while training 44.1kHz TTS models. To fix your issue, you have to modify the config such that the product of upsample_rates == hop_length.

At the moment, you have "upsample_rates": [4,4], whose product is 16 and != "hop_length": 256.

This 44.1kHz config here is correct; it has "upsample_rates": [8,8,2,2,2] whose product is 512 == "hop_length": 512.

So there are a couple of ways to fix your issue. The easiest option would be to just follow the config linked above, but if you still want to use a smaller hop length of 256, then I'd suggest you use:

"upsample_rates": [8,8,2,2],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [16,16,4,4],

which is the default setup from here.

Hope this helps!

@FENRlR
Copy link
Owner

FENRlR commented Sep 14, 2023

@w11wo Thank you for the detailed explanation. But I still wonder if it is okay to go with those settings from original vits because "upsample_rates": [4,4]([8,8] without "subbands":4) and "upsample_kernel_sizes": [16,16] are the changes introduced from the default setting of MB-iSTFT-VITS compared to the default setting of original vits while having the same hop_length. Perhaps with doubled hop_length, the calculation should start from that default "upsample_rates": [4,4] like this "upsample_rates": [4,4,2](it's from the same author though) or having doubled subbands.

@w11wo
Copy link

w11wo commented Sep 18, 2023

To be honest, I'm not too sure which of the changes you've suggested will work in the end, though they seem reasonable. It's probably a good idea to have assertion checks in place to avoid cases like this, i.e. ensuring that the config used by the user makes sense calculation-wise.

@bzp83
Copy link

bzp83 commented May 22, 2024

hi all!
any chance someone could share a config that will produce a good 44.1khz model? Anything I change from the configs provided doesn't work... I get errors about wrong tensor sizes etc etc... thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants