-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High duration loss #40
Comments
Hello, Make sure u wrote down "english_cleaners2" in synthesise.py sequence = torch.tensor(intersperse(text_to_sequence(stressed_text, ['english_cleaners2']), 0), dtype=torch.long, |
Hi @Oleksandr2505. Yes I have changed both the training and inference phonemizer to It's also unrelated to the issue I raised, which is more on the training duration loss, not inference. |
I am training my model on english 4h 22050kHz dataset, I've recently passed 1000epochs and it sounds good. Maybe you should try switch the vocoder, because at the start I had a huge artifacts, I took vocoder trained on VCTK v1 dataset and it became better. Maybe you should look for another vocoder to match 44 sample rate. You also texted about training duration loss, is it something different from regular loss? |
Yeah, your loss curves look very much like mine. I'd say I probably don't have an issue with the vocoder, since it's not an issue of artifacts, but pronunciation quality. And yes, the HiFi-GAN I'm using has been trained on 44.1kHz, which isn't a problem. My main suspect is from my rather high duration losses (~2), versus the TensorBoard graphs posted on the README which can easily be < 1.0 early on, which I thought was the expected trend. It's likely that this affects the pronunciation quality too. I just wanted to make sure that my 44.1kHz P-Flow config is correct, in case I'm missing something. But thanks for your suggestions nonetheless. |
Yea, Actually I see that author got less then 1, but we don't know his exact config. So we can only test and try, I have this parameter very close 1 as well, eventually it can get under it, but can't know on 100%. One more tip, when I was training ukrainian language and then tried to run an inferenceof the english model, I was getting half right and wrong words It was mainly cause of cleaners, but I was changing something in pflow/text in folder like "symbols" and other. So make sure you did not add something another instead of english symbols or numbers. I've fixed in this way. |
Btw, I would like to know, does your models synthesis the descent sufficient pauses between the sentences or not? I could not find how to control it |
@Oleksandr2505 I can't tell, to be honest. The pronunciation isn't good enough to determine the word pauses. But per my experiments with other TTS models (VITS, VITS2), you just have to train with audio that has the long enough pauses that you want. If you want models with more customizable pauses, I'd recommend something like FastSpeech2 which is more deterministic since it does a more deterministic duration prediction, I think. VITS/VITS2/P-Flow is better for more expressive speech. |
thank you for info! |
Update: I tried changing many different setups, but found quite an interesting change that finally led to a decent performance. The duration loss (which contributes to the overall loss), converges to a lower value faster if the model's vocab size is 178, derived from the default LJSpeech + espeak setup. In the logs above where I used gruut as the phonemizer, I used a smaller vocab size of 78 in the model config. Simply increasing the vocab size back to 178 led to a better convergence of the duration loss (for some reason). However, I'm still unable to get a loss of < 1 unlike the one posted in the README. |
But do you see difference in audio output for model with vocab 178 and 78? Or only metrics are different? |
@patriotyk Yes, I do hear a massive difference in audio output. I think the duration loss values reflect the overall performance of the audio as well, given that the validation loss also decreases with the increased vocab size. Eventually I could get the duration loss to ~1.0, which is much better than the initial experiments with the smaller vocab size. The odd thing is why vocab size even impacts the duration loss -- I'm clueless about this. |
@w11wo I have found that |
@patriotyk not really. What I meant by my comments above is that I started with vocab size of 78, which led to the high loss graph as attached at the top of this issue post. It is only by increasing it to 178 that I could achieve a low duration loss of about 1. |
uh, I misunderstood you. By changing vocab you meant also changing this value, I thought you where changing real symbols count but didn't do this in the config. |
Ah yeah. Whenever I changed the symbol list, I also made sure to change the vocab size value in the config. It's weird how an actual symbol size of 78 would benefit by increasing the model config's vocab size. |
@w11wo Hi, i just recalled u mentioned about training model using hifigan 44.1khz vocoder, where did you get it or how train it? |
Hi @Oleksandr2505. I ended up training a 44.1kHz HiFi-GAN from scratch, which took about 6 days on a 1xH100 GPU. You can find our fork of the HiFi-GAN training code/repo here. |
Hi @p0p4k, thanks for making this repo!
I am currently trying to train a 44.1kHz English model, but my model is struggling with a rather high duration loss when compared against your TensorBoard logs. It currently looks as follows:
It seems like the other loss terms are correct.
Also, when the generated mel-spectrogram is passed to a vocoder, the audio is very much wrong in pronunciation -- maybe only half right.
My P-Flow config can be found here, and the corresponding HiFi-GAN vocoder config can be found here.
Could you please let me know where I might be wrong? Thanks in advance!
The text was updated successfully, but these errors were encountered: