KSS dataset for 490 epochs but the quality is not as good as I expected #4

ggpid · 2023-09-11T08:13:59Z

First of all, thank you for sharing such a wonderful code.
I trained using the KSS dataset for 490 epochs, but the quality is not as good as I expected.
It seems that the TTS speaks a bit fast.
wav
What might have gone wrong during the training?

{
"train": {
"log_interval": 200,
"eval_interval": 3000,
"seed": 1234,
"epochs": 20000,
"learning_rate": 2e-4,
"betas": [0.8, 0.99],
"eps": 1e-9,
"batch_size": 32,
"fp16_run": false,
"lr_decay": 0.999875,
"segment_size": 8192,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"fft_sizes": [384, 683, 171],
"hop_sizes": [30, 60, 10],
"win_lengths": [150, 300, 60],
"window": "hann_window"
},
"data": {
"use_mel_posterior_encoder": true,
"training_files":"kss/kss_cjke_train.txt.cleaned",
"validation_files":"kss/kss_cjke_val.txt.cleaned",
"text_cleaners":["cjke_cleaners2"],
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 0,
"cleaned_text": true
},
"model": {
"use_mel_posterior_encoder": true,
"use_transformer_flows": true,
"transformer_flow_type": "pre_conv",
"use_spk_conditioned_encoder": false,
"use_noise_scaled_mas": true,
"use_duration_discriminator": true,
"ms_istft_vits": false,
"mb_istft_vits": true,
"istft_vits": false,
"subbands": 4,
"gen_istft_n_fft": 16,
"gen_istft_hop_size": 4,
"inter_channels": 192,
"hidden_channels": 96,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 3,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"upsample_rates": [4,4],
"upsample_initial_channel": 256,
"upsample_kernel_sizes": [16,16],
"n_layers_q": 3,
"use_spectral_norm": false,
"use_sdp": false
}
}

p0p4k · 2023-09-11T09:16:16Z

@ggpid Please let me know the outcome of the debug test, because I am interested in KSS dataset performance as well. Thanks.
(cc. p0p4k/vits2_pytorch#49)

FENRlR · 2023-09-13T02:29:47Z

Alongside with p0p4k's suggestion, I found there was an issue of getting slower duration by reducing the sampling rate to 16000Hz.
MasayaKawamura/MB-iSTFT-VITS#7 (comment)

Also, there are repos especially targeting the 44100 Hz sampling rate.
https://github.com/tonnetonne814/MB-iSTFT-VITS-44100-Ja/blob/main/configs/jsut_44100.json
https://github.com/tonnetonne814/unofficial-vits2-44100-Ja/blob/main/configs/vits2_jsut_nosdp.json

Which share doubled segment size, fft sizes, etc. in common. (changes stated as below)
(+ subbands which is mb/ms-istft exclusive)

{
  "train": {
    "segment_size": 16384,
    "fft_sizes": [768, 1366, 342],
    "hop_sizes": [60, 120, 20], 
    "win_lengths": [300, 600, 120],
  },
  "data": {
    "sampling_rate": 44100,
    "filter_length": 2048,
    "hop_length": 512, 
    "win_length": 2048, 
    "add_blank": false, 
  },
  "model": {
      "subbands": 8,
      "upsample_initial_channel": 512,
    }
}

w11wo · 2023-09-14T07:59:59Z

I've bumped into this issue a couple of times while training 44.1kHz TTS models. To fix your issue, you have to modify the config such that the product of upsample_rates == hop_length.

At the moment, you have "upsample_rates": [4,4], whose product is 16 and != "hop_length": 256.

This 44.1kHz config here is correct; it has "upsample_rates": [8,8,2,2,2] whose product is 512 == "hop_length": 512.

So there are a couple of ways to fix your issue. The easiest option would be to just follow the config linked above, but if you still want to use a smaller hop length of 256, then I'd suggest you use:

"upsample_rates": [8,8,2,2],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [16,16,4,4],

which is the default setup from here.

Hope this helps!

FENRlR · 2023-09-14T13:47:29Z

@w11wo Thank you for the detailed explanation. But I still wonder if it is okay to go with those settings from original vits because "upsample_rates": [4,4]([8,8] without "subbands":4) and "upsample_kernel_sizes": [16,16] are the changes introduced from the default setting of MB-iSTFT-VITS compared to the default setting of original vits while having the same hop_length. Perhaps with doubled hop_length, the calculation should start from that default "upsample_rates": [4,4] like this "upsample_rates": [4,4,2](it's from the same author though) or having doubled subbands.

w11wo · 2023-09-18T03:30:16Z

To be honest, I'm not too sure which of the changes you've suggested will work in the end, though they seem reasonable. It's probably a good idea to have assertion checks in place to avoid cases like this, i.e. ensuring that the config used by the user makes sense calculation-wise.

bzp83 · 2024-05-22T00:05:57Z

hi all!
any chance someone could share a config that will produce a good 44.1khz model? Anything I change from the configs provided doesn't work... I get errors about wrong tensor sizes etc etc... thanks

ggpid closed this as completed Sep 12, 2023

FENRlR reopened this Sep 13, 2023

p0p4k mentioned this issue Sep 14, 2023

KSS dataset for 490 epochs but the quality is not as good as I expected p0p4k/vits2_pytorch#49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KSS dataset for 490 epochs but the quality is not as good as I expected #4

KSS dataset for 490 epochs but the quality is not as good as I expected #4

ggpid commented Sep 11, 2023

p0p4k commented Sep 11, 2023 •

edited

Loading

FENRlR commented Sep 13, 2023 •

edited

Loading

w11wo commented Sep 14, 2023

FENRlR commented Sep 14, 2023 •

edited

Loading

w11wo commented Sep 18, 2023

bzp83 commented May 22, 2024

KSS dataset for 490 epochs but the quality is not as good as I expected #4

KSS dataset for 490 epochs but the quality is not as good as I expected #4

Comments

ggpid commented Sep 11, 2023

p0p4k commented Sep 11, 2023 • edited Loading

FENRlR commented Sep 13, 2023 • edited Loading

w11wo commented Sep 14, 2023

FENRlR commented Sep 14, 2023 • edited Loading

w11wo commented Sep 18, 2023

bzp83 commented May 22, 2024

p0p4k commented Sep 11, 2023 •

edited

Loading

FENRlR commented Sep 13, 2023 •

edited

Loading

FENRlR commented Sep 14, 2023 •

edited

Loading