NAN loss #31

0913ktg · 2024-02-22T01:22:47Z

Hello p0p4k,

I've begun training a PFlow Korean model using the code you shared. However, I encountered a nan loss during the training process. I used a publicly available Korean dataset and structured the filelist in a single speaker format with filename|text.

Although the dataset contains over 2000 speakers, it lacks speaker labels, so I trained it using a single-speaker setting. I understand that differences in data and preprocessing methods might lead to various issues, but if you have any insights into the potential causes of nan loss, I would greatly appreciate your advice.

It's snowing heavily in Korea right now. Have a great day.

At first, learning seems to be going well, but then suddenly something goes wrong.

0913ktg · 2024-02-22T01:28:11Z

The training environment used cuda11.8, pytorch 2.1.2, torchvision 2.1.2, torchvision 0.16.2, and DDP training using four NVIDIA A100-SXM4 (80G) cards.
The data used 253K audio-text pairs 256 batch size, and the text was phonemically converted using the Korean phoneme conversion module G2PK.
We are currently in the process of retraining the model by reducing the batch size to 64.

p0p4k · 2024-02-22T04:18:28Z

KSS dataset? 오늘 눈이 많아서 조심하세요!

p0p4k · 2024-02-22T04:20:33Z

Ah, it is not KSS dataset, multispeaker dataset! Maybe there is too much variance, can you try to take small subset of 3-4 speakers and train that first?

p0p4k · 2024-02-22T04:23:08Z

For me I got Nan loss sometimes cause of dataset issue.

0913ktg · 2024-02-22T06:23:45Z

After changing the batch size to 64, the model is not showing any nan_loss. I will continue to monitor and share the results.

Additionally, there is a part where the original mel-spectrogram is added to tensorboard with add_image without removing zero-padding. It would be beneficial to add code that removes zero-padding using y_lengths of the batch.

Lastly, while it was observed that the GPU usage was at 100% with the vits2 repo by p0p4k, it seems that this repo is not utilizing the GPU as efficiently.

I wanted to inquire if there are any ongoing developments related to this.

Thank you always for your prompt response.

p0p4k · 2024-02-22T11:57:14Z

About gpu usage, it might be because of dataloader. We might have to investigate that. Keep me updated with samples. Good day!

Tera2Space · 2024-02-26T16:30:24Z

Try to disable fp16 and use fp32

matteotesta · 2024-03-06T09:06:43Z

That is due to the matmul of query and key going overflow with float16. You can find a solution to that problem in Sec. 2.4 in this paper (https://arxiv.org/pdf/2105.13290.pdf) see eq. 4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAN loss #31

NAN loss #31

0913ktg commented Feb 22, 2024

0913ktg commented Feb 22, 2024 •

edited

Loading

p0p4k commented Feb 22, 2024 •

edited

Loading

p0p4k commented Feb 22, 2024

p0p4k commented Feb 22, 2024

0913ktg commented Feb 22, 2024

p0p4k commented Feb 22, 2024

Tera2Space commented Feb 26, 2024

matteotesta commented Mar 6, 2024

NAN loss #31

NAN loss #31

Comments

0913ktg commented Feb 22, 2024

0913ktg commented Feb 22, 2024 • edited Loading

p0p4k commented Feb 22, 2024 • edited Loading

p0p4k commented Feb 22, 2024

p0p4k commented Feb 22, 2024

0913ktg commented Feb 22, 2024

p0p4k commented Feb 22, 2024

Tera2Space commented Feb 26, 2024

matteotesta commented Mar 6, 2024

0913ktg commented Feb 22, 2024 •

edited

Loading

p0p4k commented Feb 22, 2024 •

edited

Loading