High duration loss #40

w11wo · 2024-03-26T03:01:10Z

Hi @p0p4k, thanks for making this repo!

I am currently trying to train a 44.1kHz English model, but my model is struggling with a rather high duration loss when compared against your TensorBoard logs. It currently looks as follows:

It seems like the other loss terms are correct.

Also, when the generated mel-spectrogram is passed to a vocoder, the audio is very much wrong in pronunciation -- maybe only half right.

My P-Flow config can be found here, and the corresponding HiFi-GAN vocoder config can be found here.

Could you please let me know where I might be wrong? Thanks in advance!

Oleksandr2505 · 2024-03-26T04:06:52Z

Hi @p0p4k, thanks for making this repo!

I am currently trying to train a 44.1kHz English model, but my model is struggling with a rather high duration loss when compared against your TensorBoard logs. It currently looks as follows:

It seems like the other loss terms are correct.

Also, when the generated mel-spectrogram is passed to a vocoder, the audio is very much wrong in pronunciation -- maybe only half right.

My P-Flow config can be found here, and the corresponding HiFi-GAN vocoder config can be found here.

Could you please let me know where I might be wrong? Thanks in advance!

Hello, Make sure u wrote down "english_cleaners2" in synthesise.py
it has to be like this when u generate english speech, if you included "english_cleaners3" in your json config, so put it in synthesise.py

sequence = torch.tensor(intersperse(text_to_sequence(stressed_text, ['english_cleaners2']), 0), dtype=torch.long,

w11wo · 2024-03-26T04:10:14Z

Hi @Oleksandr2505.

Yes I have changed both the training and inference phonemizer to english_cleaners3, so it's definitely not that issue. I've also checked the phonemization output and it is correct.

It's also unrelated to the issue I raised, which is more on the training duration loss, not inference.

Oleksandr2505 · 2024-03-26T04:19:52Z

Hi @Oleksandr2505.

Yes I have changed both the training and inference phonemizer to english_cleaners3, so it's definitely not that issue. I've also checked the phonemization output and it is correct.

It's also unrelated to the issue I raised, which is more on the training duration loss, not inference.

I am training my model on english 4h 22050kHz dataset, I've recently passed 1000epochs and it sounds good. Maybe you should try switch the vocoder, because at the start I had a huge artifacts, I took vocoder trained on VCTK v1 dataset and it became better. Maybe you should look for another vocoder to match 44 sample rate. You also texted about training duration loss, is it something different from regular loss?
Btw this is my logs.

w11wo · 2024-03-26T04:25:47Z

Yeah, your loss curves look very much like mine.

I'd say I probably don't have an issue with the vocoder, since it's not an issue of artifacts, but pronunciation quality. And yes, the HiFi-GAN I'm using has been trained on 44.1kHz, which isn't a problem.

My main suspect is from my rather high duration losses (~2), versus the TensorBoard graphs posted on the README which can easily be < 1.0 early on, which I thought was the expected trend. It's likely that this affects the pronunciation quality too.

I just wanted to make sure that my 44.1kHz P-Flow config is correct, in case I'm missing something. But thanks for your suggestions nonetheless.

Oleksandr2505 · 2024-03-26T04:40:11Z

Yeah, your loss curves look very much like mine.

I'd say I probably don't have an issue with the vocoder, since it's not an issue of artifacts, but pronunciation quality. And yes, the HiFi-GAN I'm using has been trained on 44.1kHz, which isn't a problem.

My main suspect is from my rather high duration losses (~2), versus the TensorBoard graphs posted on the README which can easily be < 1.0 early on, which I thought was the expected trend. It's likely that this affects the pronunciation quality too.

I just wanted to make sure that my 44.1kHz P-Flow config is correct, in case I'm missing something. But thanks for your suggestions nonetheless.

Yea, Actually I see that author got less then 1, but we don't know his exact config. So we can only test and try, I have this parameter very close 1 as well, eventually it can get under it, but can't know on 100%. One more tip, when I was training ukrainian language and then tried to run an inferenceof the english model, I was getting half right and wrong words It was mainly cause of cleaners, but I was changing something in pflow/text in folder like "symbols" and other. So make sure you did not add something another instead of english symbols or numbers. I've fixed in this way.

Oleksandr2505 · 2024-03-26T04:52:11Z

Yeah, your loss curves look very much like mine.

I'd say I probably don't have an issue with the vocoder, since it's not an issue of artifacts, but pronunciation quality. And yes, the HiFi-GAN I'm using has been trained on 44.1kHz, which isn't a problem.

My main suspect is from my rather high duration losses (~2), versus the TensorBoard graphs posted on the README which can easily be < 1.0 early on, which I thought was the expected trend. It's likely that this affects the pronunciation quality too.

I just wanted to make sure that my 44.1kHz P-Flow config is correct, in case I'm missing something. But thanks for your suggestions nonetheless.

Btw, I would like to know, does your models synthesis the descent sufficient pauses between the sentences or not? I could not find how to control it

w11wo · 2024-03-26T05:05:08Z

@Oleksandr2505 I can't tell, to be honest. The pronunciation isn't good enough to determine the word pauses. But per my experiments with other TTS models (VITS, VITS2), you just have to train with audio that has the long enough pauses that you want.

If you want models with more customizable pauses, I'd recommend something like FastSpeech2 which is more deterministic since it does a more deterministic duration prediction, I think. VITS/VITS2/P-Flow is better for more expressive speech.

Oleksandr2505 · 2024-03-26T05:09:20Z

@w11wo

thank you for info!

w11wo · 2024-03-28T08:34:14Z

Update: I tried changing many different setups, but found quite an interesting change that finally led to a decent performance.

The duration loss (which contributes to the overall loss), converges to a lower value faster if the model's vocab size is 178, derived from the default LJSpeech + espeak setup. In the logs above where I used gruut as the phonemizer, I used a smaller vocab size of 78 in the model config. Simply increasing the vocab size back to 178 led to a better convergence of the duration loss (for some reason). However, I'm still unable to get a loss of < 1 unlike the one posted in the README.

patriotyk · 2024-04-02T20:02:04Z

But do you see difference in audio output for model with vocab 178 and 78? Or only metrics are different?

w11wo · 2024-04-03T02:57:46Z

@patriotyk Yes, I do hear a massive difference in audio output. I think the duration loss values reflect the overall performance of the audio as well, given that the validation loss also decreases with the increased vocab size.

Eventually I could get the duration loss to ~1.0, which is much better than the initial experiments with the smaller vocab size.

The odd thing is why vocab size even impacts the duration loss -- I'm clueless about this.

patriotyk · 2024-04-03T15:41:30Z

@w11wo I have found that configs/model/pflow.yaml contains n_vocab that is equal to 178. I thing it should be 78 in your case and maybe you will get even better(less then 1.0) duration loss because all 78 symbols are real and you will not have unused symbols.

w11wo · 2024-04-03T15:44:24Z

@patriotyk not really. What I meant by my comments above is that I started with vocab size of 78, which led to the high loss graph as attached at the top of this issue post. It is only by increasing it to 178 that I could achieve a low duration loss of about 1.

patriotyk · 2024-04-03T15:48:37Z

uh, I misunderstood you. By changing vocab you meant also changing this value, I thought you where changing real symbols count but didn't do this in the config.

w11wo · 2024-04-03T15:50:57Z

Ah yeah. Whenever I changed the symbol list, I also made sure to change the vocab size value in the config. It's weird how an actual symbol size of 78 would benefit by increasing the model config's vocab size.

Oleksandr2505 · 2024-06-02T17:27:41Z

@w11wo Hi, i just recalled u mentioned about training model using hifigan 44.1khz vocoder, where did you get it or how train it?

w11wo · 2024-06-03T03:57:53Z

Hi @Oleksandr2505. I ended up training a 44.1kHz HiFi-GAN from scratch, which took about 6 days on a 1xH100 GPU. You can find our fork of the HiFi-GAN training code/repo here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High duration loss #40

High duration loss #40

w11wo commented Mar 26, 2024

Oleksandr2505 commented Mar 26, 2024 •

edited

Loading

w11wo commented Mar 26, 2024

Oleksandr2505 commented Mar 26, 2024 •

edited

Loading

w11wo commented Mar 26, 2024

Oleksandr2505 commented Mar 26, 2024 •

edited

Loading

Oleksandr2505 commented Mar 26, 2024

w11wo commented Mar 26, 2024

Oleksandr2505 commented Mar 26, 2024

w11wo commented Mar 28, 2024

patriotyk commented Apr 2, 2024

w11wo commented Apr 3, 2024

patriotyk commented Apr 3, 2024 •

edited

Loading

w11wo commented Apr 3, 2024

patriotyk commented Apr 3, 2024

w11wo commented Apr 3, 2024

Oleksandr2505 commented Jun 2, 2024

w11wo commented Jun 3, 2024

High duration loss #40

High duration loss #40

Comments

w11wo commented Mar 26, 2024

Oleksandr2505 commented Mar 26, 2024 • edited Loading

w11wo commented Mar 26, 2024

Oleksandr2505 commented Mar 26, 2024 • edited Loading

w11wo commented Mar 26, 2024

Oleksandr2505 commented Mar 26, 2024 • edited Loading

Oleksandr2505 commented Mar 26, 2024

w11wo commented Mar 26, 2024

Oleksandr2505 commented Mar 26, 2024

w11wo commented Mar 28, 2024

patriotyk commented Apr 2, 2024

w11wo commented Apr 3, 2024

patriotyk commented Apr 3, 2024 • edited Loading

w11wo commented Apr 3, 2024

patriotyk commented Apr 3, 2024

w11wo commented Apr 3, 2024

Oleksandr2505 commented Jun 2, 2024

w11wo commented Jun 3, 2024

Oleksandr2505 commented Mar 26, 2024 •

edited

Loading

Oleksandr2505 commented Mar 26, 2024 •

edited

Loading

Oleksandr2505 commented Mar 26, 2024 •

edited

Loading

patriotyk commented Apr 3, 2024 •

edited

Loading