Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High duration loss #40

Open
w11wo opened this issue Mar 26, 2024 · 17 comments
Open

High duration loss #40

w11wo opened this issue Mar 26, 2024 · 17 comments

Comments

@w11wo
Copy link

w11wo commented Mar 26, 2024

Hi @p0p4k, thanks for making this repo!

I am currently trying to train a 44.1kHz English model, but my model is struggling with a rather high duration loss when compared against your TensorBoard logs. It currently looks as follows:

image

It seems like the other loss terms are correct.

Also, when the generated mel-spectrogram is passed to a vocoder, the audio is very much wrong in pronunciation -- maybe only half right.

My P-Flow config can be found here, and the corresponding HiFi-GAN vocoder config can be found here.

Could you please let me know where I might be wrong? Thanks in advance!

@Oleksandr2505
Copy link

Oleksandr2505 commented Mar 26, 2024

Hi @p0p4k, thanks for making this repo!

I am currently trying to train a 44.1kHz English model, but my model is struggling with a rather high duration loss when compared against your TensorBoard logs. It currently looks as follows:

image

It seems like the other loss terms are correct.

Also, when the generated mel-spectrogram is passed to a vocoder, the audio is very much wrong in pronunciation -- maybe only half right.

My P-Flow config can be found here, and the corresponding HiFi-GAN vocoder config can be found here.

Could you please let me know where I might be wrong? Thanks in advance!

Hello, Make sure u wrote down "english_cleaners2" in synthesise.py
it has to be like this when u generate english speech, if you included "english_cleaners3" in your json config, so put it in synthesise.py

sequence = torch.tensor(intersperse(text_to_sequence(stressed_text, ['english_cleaners2']), 0), dtype=torch.long,

@w11wo
Copy link
Author

w11wo commented Mar 26, 2024

Hi @Oleksandr2505.

Yes I have changed both the training and inference phonemizer to english_cleaners3, so it's definitely not that issue. I've also checked the phonemization output and it is correct.

It's also unrelated to the issue I raised, which is more on the training duration loss, not inference.

@Oleksandr2505
Copy link

Oleksandr2505 commented Mar 26, 2024

Hi @Oleksandr2505.

Yes I have changed both the training and inference phonemizer to english_cleaners3, so it's definitely not that issue. I've also checked the phonemization output and it is correct.

It's also unrelated to the issue I raised, which is more on the training duration loss, not inference.

I am training my model on english 4h 22050kHz dataset, I've recently passed 1000epochs and it sounds good. Maybe you should try switch the vocoder, because at the start I had a huge artifacts, I took vocoder trained on VCTK v1 dataset and it became better. Maybe you should look for another vocoder to match 44 sample rate. You also texted about training duration loss, is it something different from regular loss?
Btw this is my logs.
image_2024-03-26_062009153

@w11wo
Copy link
Author

w11wo commented Mar 26, 2024

Yeah, your loss curves look very much like mine.

I'd say I probably don't have an issue with the vocoder, since it's not an issue of artifacts, but pronunciation quality. And yes, the HiFi-GAN I'm using has been trained on 44.1kHz, which isn't a problem.

My main suspect is from my rather high duration losses (~2), versus the TensorBoard graphs posted on the README which can easily be < 1.0 early on, which I thought was the expected trend. It's likely that this affects the pronunciation quality too.

I just wanted to make sure that my 44.1kHz P-Flow config is correct, in case I'm missing something. But thanks for your suggestions nonetheless.

@Oleksandr2505
Copy link

Oleksandr2505 commented Mar 26, 2024

Yeah, your loss curves look very much like mine.

I'd say I probably don't have an issue with the vocoder, since it's not an issue of artifacts, but pronunciation quality. And yes, the HiFi-GAN I'm using has been trained on 44.1kHz, which isn't a problem.

My main suspect is from my rather high duration losses (~2), versus the TensorBoard graphs posted on the README which can easily be < 1.0 early on, which I thought was the expected trend. It's likely that this affects the pronunciation quality too.

I just wanted to make sure that my 44.1kHz P-Flow config is correct, in case I'm missing something. But thanks for your suggestions nonetheless.

Yea, Actually I see that author got less then 1, but we don't know his exact config. So we can only test and try, I have this parameter very close 1 as well, eventually it can get under it, but can't know on 100%. One more tip, when I was training ukrainian language and then tried to run an inferenceof the english model, I was getting half right and wrong words It was mainly cause of cleaners, but I was changing something in pflow/text in folder like "symbols" and other. So make sure you did not add something another instead of english symbols or numbers. I've fixed in this way.

image_2024-03-26_064010400

@Oleksandr2505
Copy link

Yeah, your loss curves look very much like mine.

I'd say I probably don't have an issue with the vocoder, since it's not an issue of artifacts, but pronunciation quality. And yes, the HiFi-GAN I'm using has been trained on 44.1kHz, which isn't a problem.

My main suspect is from my rather high duration losses (~2), versus the TensorBoard graphs posted on the README which can easily be < 1.0 early on, which I thought was the expected trend. It's likely that this affects the pronunciation quality too.

I just wanted to make sure that my 44.1kHz P-Flow config is correct, in case I'm missing something. But thanks for your suggestions nonetheless.

Btw, I would like to know, does your models synthesis the descent sufficient pauses between the sentences or not? I could not find how to control it

@w11wo
Copy link
Author

w11wo commented Mar 26, 2024

@Oleksandr2505 I can't tell, to be honest. The pronunciation isn't good enough to determine the word pauses. But per my experiments with other TTS models (VITS, VITS2), you just have to train with audio that has the long enough pauses that you want.

If you want models with more customizable pauses, I'd recommend something like FastSpeech2 which is more deterministic since it does a more deterministic duration prediction, I think. VITS/VITS2/P-Flow is better for more expressive speech.

@Oleksandr2505
Copy link

@w11wo

thank you for info!

@w11wo
Copy link
Author

w11wo commented Mar 28, 2024

Update: I tried changing many different setups, but found quite an interesting change that finally led to a decent performance.

The duration loss (which contributes to the overall loss), converges to a lower value faster if the model's vocab size is 178, derived from the default LJSpeech + espeak setup. In the logs above where I used gruut as the phonemizer, I used a smaller vocab size of 78 in the model config. Simply increasing the vocab size back to 178 led to a better convergence of the duration loss (for some reason). However, I'm still unable to get a loss of < 1 unlike the one posted in the README.

@patriotyk
Copy link

But do you see difference in audio output for model with vocab 178 and 78? Or only metrics are different?

@w11wo
Copy link
Author

w11wo commented Apr 3, 2024

@patriotyk Yes, I do hear a massive difference in audio output. I think the duration loss values reflect the overall performance of the audio as well, given that the validation loss also decreases with the increased vocab size.

Eventually I could get the duration loss to ~1.0, which is much better than the initial experiments with the smaller vocab size.

The odd thing is why vocab size even impacts the duration loss -- I'm clueless about this.

@patriotyk
Copy link

patriotyk commented Apr 3, 2024

@w11wo I have found that configs/model/pflow.yaml contains n_vocab that is equal to 178. I thing it should be 78 in your case and maybe you will get even better(less then 1.0) duration loss because all 78 symbols are real and you will not have unused symbols.

@w11wo
Copy link
Author

w11wo commented Apr 3, 2024

@patriotyk not really. What I meant by my comments above is that I started with vocab size of 78, which led to the high loss graph as attached at the top of this issue post. It is only by increasing it to 178 that I could achieve a low duration loss of about 1.

@patriotyk
Copy link

uh, I misunderstood you. By changing vocab you meant also changing this value, I thought you where changing real symbols count but didn't do this in the config.

@w11wo
Copy link
Author

w11wo commented Apr 3, 2024

Ah yeah. Whenever I changed the symbol list, I also made sure to change the vocab size value in the config. It's weird how an actual symbol size of 78 would benefit by increasing the model config's vocab size.

@Oleksandr2505
Copy link

@w11wo Hi, i just recalled u mentioned about training model using hifigan 44.1khz vocoder, where did you get it or how train it?

@w11wo
Copy link
Author

w11wo commented Jun 3, 2024

Hi @Oleksandr2505. I ended up training a 44.1kHz HiFi-GAN from scratch, which took about 6 days on a 1xH100 GPU. You can find our fork of the HiFi-GAN training code/repo here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants