Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAN loss #31

Open
0913ktg opened this issue Feb 22, 2024 · 8 comments
Open

NAN loss #31

0913ktg opened this issue Feb 22, 2024 · 8 comments

Comments

@0913ktg
Copy link

0913ktg commented Feb 22, 2024

Hello p0p4k,

I've begun training a PFlow Korean model using the code you shared. However, I encountered a nan loss during the training process. I used a publicly available Korean dataset and structured the filelist in a single speaker format with filename|text.

Although the dataset contains over 2000 speakers, it lacks speaker labels, so I trained it using a single-speaker setting. I understand that differences in data and preprocessing methods might lead to various issues, but if you have any insights into the potential causes of nan loss, I would greatly appreciate your advice.

It's snowing heavily in Korea right now. Have a great day.

image

At first, learning seems to be going well, but then suddenly something goes wrong.

image
image
image

@0913ktg
Copy link
Author

0913ktg commented Feb 22, 2024

The training environment used cuda11.8, pytorch 2.1.2, torchvision 2.1.2, torchvision 0.16.2, and DDP training using four NVIDIA A100-SXM4 (80G) cards.
The data used 253K audio-text pairs 256 batch size, and the text was phonemically converted using the Korean phoneme conversion module G2PK.
We are currently in the process of retraining the model by reducing the batch size to 64.

@p0p4k
Copy link
Owner

p0p4k commented Feb 22, 2024

KSS dataset? 오늘 눈이 많아서 조심하세요!

@p0p4k
Copy link
Owner

p0p4k commented Feb 22, 2024

Ah, it is not KSS dataset, multispeaker dataset! Maybe there is too much variance, can you try to take small subset of 3-4 speakers and train that first?

@p0p4k
Copy link
Owner

p0p4k commented Feb 22, 2024

For me I got Nan loss sometimes cause of dataset issue.

@0913ktg
Copy link
Author

0913ktg commented Feb 22, 2024

After changing the batch size to 64, the model is not showing any nan_loss. I will continue to monitor and share the results.

Additionally, there is a part where the original mel-spectrogram is added to tensorboard with add_image without removing zero-padding. It would be beneficial to add code that removes zero-padding using y_lengths of the batch.

Lastly, while it was observed that the GPU usage was at 100% with the vits2 repo by p0p4k, it seems that this repo is not utilizing the GPU as efficiently.

I wanted to inquire if there are any ongoing developments related to this.

Thank you always for your prompt response.

@p0p4k
Copy link
Owner

p0p4k commented Feb 22, 2024

About gpu usage, it might be because of dataloader. We might have to investigate that. Keep me updated with samples. Good day!

@Tera2Space
Copy link

Try to disable fp16 and use fp32

@matteotesta
Copy link

That is due to the matmul of query and key going overflow with float16. You can find a solution to that problem in Sec. 2.4 in this paper (https://arxiv.org/pdf/2105.13290.pdf) see eq. 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants