Crash in MAS #24

patriotyk · 2024-02-06T22:13:55Z

We are experiencing a strange issue. With one our big dataset (about 300 hours) MAS is randomly crashes. Core dump shows following line:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007feaf9884576 in __pyx_f_5pflow_5utils_15monotonic_align_4core_maximum_path_each (__pyx_v_path=..., __pyx_v_value=..., __pyx_v_path=..., __pyx_v_value=..., __pyx_optional_args=0x0, __pyx_v_t_x=289, __pyx_v_t_y=279)
    at pflow/utils/monotonic_align/core.c:17615
17615       if (__pyx_t_7) {

We have tried everything but nothing did help.
The only thing that helped was replacing MAS with AlignerNet but there was another issue - crash at inference, maybe synthesis method requires some changes too?

I have successfully trained pflowttss on single speaker dataset which is subset of this bigger dataset and it sounds great. Demo is here - https://tts.patriotyk.name

Also I have built and pushed to registry docker image which can be used to reproduce this issue, just need to pull and run it. I can share url in private message if you need it.

The text was updated successfully, but these errors were encountered:

p0p4k · 2024-02-07T11:09:57Z

For MAS debug, I am not good at C++, yet. I can suggest one thing, just run MAS (encoder + spectrogram), everything else can be deleted in batch. At some batch it will fail, open that batch and run one sample at a time, you will the sample which gives problem.

patriotyk · 2024-02-07T19:04:27Z

Thank you for your fast answer.
We have found files that causes crash, but they looks normal, after removing that file we are able to run train but then it crashes on another one. Also it may not crash few epochs, so all files have been successfully used for train, but then it crashes. Maybe it would be easier to switch to AlignerNet? I have uncommented code that you commented in constructor and in forward method and commented call to MAS. This works fine, it trains without crashes but inference crashes. Maybe you could help us with this? Do I need to change something in synthesise method to work it properly?

p0p4k · 2024-02-07T22:44:08Z

Sure, I'll fix AlignerNet synthesis in this week. What is the error during inference?

patriotyk · 2024-02-09T21:18:03Z

Oh you have edited you answer. It was a crash. If you need more info I can try to run it again and will tell you. But I think I may made some mistake. Maybe you can push changes somewhere in separate branch and I will compare.

p0p4k · 2024-02-10T02:19:38Z

Someone else tried aligner net and it worked ok for them. So I am not sure how to debug without error, if it's crash, maybe dataset issue? Does AlignerNet train on the small subset?

Tera2Space · 2024-02-10T03:34:52Z

I have uncommented code that you commented in constructor and in forward method and commented call to MAS. This works fine, it trains without crashes but inference crashes.

But AlignerNet isn't used during inference

p0p4k · 2024-02-10T04:36:44Z

True. If it crashes during training, I think it is the dataset issue.

patriotyk · 2024-02-10T07:22:48Z

No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.

p0p4k · 2024-02-10T11:47:21Z

Send some random input to the duration predictor, does it predict something?

Tera2Space · 2024-02-10T12:54:24Z

No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.

Is the error something like "out of memory"?

patriotyk · 2024-02-13T10:26:41Z

No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.

Is the error something like "out of memory"?

This evening I will try to run it again and tell you more details.

Tera2Space · 2024-02-14T14:42:45Z

No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.

Is the error something like "out of memory"?

This evening I will try to run it again and tell you more details.

If you want we can talk in telegram (https://t.me/TeraSpace) I speak Ukrainian and Russian.

patriotyk · 2024-02-14T20:45:02Z

@p0p4k I have tried again and yes, on inference I got out of memory error. Same as @Tera2Space mentioned.

p0p4k · 2024-02-14T20:46:36Z

So, maybe try a small sentence inference? Does that work? If it does, then just memory issue and not code ksse.

patriotyk · 2024-02-14T21:02:25Z

It is small sentence. No it doesn't work. It doing something very long than crashes. https://drive.google.com/file/d/1WaIYiloaf3oDVtkWb5LH8YN0XWW2KXbR/view?usp=drivesdk

Tera2Space · 2024-02-15T00:27:49Z

Yep, same, I believe that AlignerNet didn't converged so duration predictor learn wrong alignments so at inference audio become very long and cause out of memory.

Tera2Space · 2024-02-15T00:29:55Z

So, maybe try a small sentence inference? Does that work? If it does, then just memory issue and not code ksse.

1500 gibs of vram.... I think it's code issue because it happens at evaluation at training, when with MAS it works fine.

p0p4k · 2024-02-15T02:19:42Z

Give the model some random durations instead of using the duration predictor and try to see the output. (One duration integer per phoneme)

patriotyk · 2024-02-16T18:51:00Z

Sorry, but I don't know how to do that.

Tera2Space · 2024-02-17T13:22:38Z

I will try later to clamp out of aligner.

Tera2Space · 2024-03-07T14:49:31Z

Now I’m wondering if the problem might be that we use text encoder outputs as input to alignernet, which(text encoder outputs) are passed through convolution (to get dimensions like mel frame)? Because while I was testing pitch predictor it didn't work when conditioned on output of text encoder, but when i tried to use x_emb directly it worked.

I will test and if work I will create PR

p0p4k · 2024-03-07T20:35:24Z

Now I’m wondering if the problem might be that we use text encoder outputs as input to alignernet, which(text encoder outputs) are passed through convolution (to get dimensions like mel frame)? Because while I was testing pitch predictor it didn't work when conditioned on output of text encoder, but when i tried to use x_emb directly it worked.

I will test and if work I will create PR

Very interesting 🤔

lexkoro · 2024-05-03T02:26:43Z

also here the wild guess, do you mind trying to use the numpy version of maximum_path search:
https://github.com/coqui-ai/TTS/blob/dbf1a08a0d4e47fdad6172e433eeb34bc6b13b4e/TTS/tts/utils/helpers.py#L197

Long time ago I also had problems with seg faults, running the training with gdb showed that it was related to maximum_path and using the numpy version fixed it.

Tera2Space mentioned this issue Mar 7, 2024

AlignerNet instead of MAS p0p4k/vits2_pytorch#81

Open

patriotyk mentioned this issue May 3, 2024

Segmentation fault while training on a new language #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in MAS #24

Crash in MAS #24

patriotyk commented Feb 6, 2024

p0p4k commented Feb 7, 2024

patriotyk commented Feb 7, 2024

p0p4k commented Feb 7, 2024 •

edited

Loading

patriotyk commented Feb 9, 2024

p0p4k commented Feb 10, 2024

Tera2Space commented Feb 10, 2024

p0p4k commented Feb 10, 2024

patriotyk commented Feb 10, 2024

p0p4k commented Feb 10, 2024

Tera2Space commented Feb 10, 2024

patriotyk commented Feb 13, 2024

Tera2Space commented Feb 14, 2024

patriotyk commented Feb 14, 2024

p0p4k commented Feb 14, 2024

patriotyk commented Feb 14, 2024

Tera2Space commented Feb 15, 2024

Tera2Space commented Feb 15, 2024

p0p4k commented Feb 15, 2024

patriotyk commented Feb 16, 2024

Tera2Space commented Feb 17, 2024

Tera2Space commented Mar 7, 2024 •

edited

Loading

p0p4k commented Mar 7, 2024

lexkoro commented May 3, 2024

Crash in MAS #24

Crash in MAS #24

Comments

patriotyk commented Feb 6, 2024

p0p4k commented Feb 7, 2024

patriotyk commented Feb 7, 2024

p0p4k commented Feb 7, 2024 • edited Loading

patriotyk commented Feb 9, 2024

p0p4k commented Feb 10, 2024

Tera2Space commented Feb 10, 2024

p0p4k commented Feb 10, 2024

patriotyk commented Feb 10, 2024

p0p4k commented Feb 10, 2024

Tera2Space commented Feb 10, 2024

patriotyk commented Feb 13, 2024

Tera2Space commented Feb 14, 2024

patriotyk commented Feb 14, 2024

p0p4k commented Feb 14, 2024

patriotyk commented Feb 14, 2024

Tera2Space commented Feb 15, 2024

Tera2Space commented Feb 15, 2024

p0p4k commented Feb 15, 2024

patriotyk commented Feb 16, 2024

Tera2Space commented Feb 17, 2024

Tera2Space commented Mar 7, 2024 • edited Loading

p0p4k commented Mar 7, 2024

lexkoro commented May 3, 2024

p0p4k commented Feb 7, 2024 •

edited

Loading

Tera2Space commented Mar 7, 2024 •

edited

Loading