about zero-shot inference #37

0913ktg · 2024-03-07T06:44:06Z

Hello p0p4k,

I'm reaching out to you again with a question.

Thanks to your great help, I've successfully trained and inferred the Korean pflow model. During the inference process, I observed a few limitations, but I confirmed that satisfactory voices are synthesized for seen speakers.

I used data from about 3,000 male and female speakers, only utilizing voice files with durations longer than 4.1 seconds. I conducted distributed learning with a batch size of 64 on NVIDIA A100 40G (4 units), completing 160 epochs (500k steps).

However, when synthesizing voices using unseen speakers' voices as prompts, I found that while the voice content is well synthesized, the speakers' voices are not applied to the synthesized sound.

This phenomenon was observed for both male and female speakers, and the inference code was written referring to the synthesis.ipynb (almost identical).

I'm looking into why the speaker's voice characteristics are not applied in zero-shot inference.

If you have experienced the same issue or know anything about it, I would appreciate your help. If there's any additional information I should provide, please comment below.

Thank you.

0913ktg · 2024-03-07T06:47:27Z

unseen speaker prompt inference mel-spectrogram
seen speaker prompt inference mel-spectrogram

p0p4k · 2024-03-07T08:22:04Z

So that means the speech prompt encoder is not extracting exact speaker style info, but rather memorizing the seen speaker information. We might have to change the architecture a little bit in that case.

0913ktg · 2024-03-07T09:49:27Z

Thank you for your response.
I will try to modify it to extract speaker characteristics in comparison with the content of the paper.

If I achieve good results, I will make sure to share them with you.
Thank you once again for your invaluable help.
Have a great day!

p0p4k · 2024-03-07T09:57:50Z

Maybe take a look at audiobox/voicebox architecture as well.

yiwei0730 · 2024-03-11T09:44:13Z

Maybe take a look at audiobox/voicebox architecture as well.

maybe the naturalspeech 2 speech prompt encoder can help?
but I'm not sure its really be useful, Do you have any suggestions?

NS2 Speech Prompt Encoder
Transformer Layer 6
Attention Heads 8
Hidden Size 512
Conv1D Filter Size 2048
Conv1D Kernel Size 9
Dropout 0.2
Parameters 69M

p-flow Speech-prompted Text Encoder
Phoneme Embedding Dim 192
PreNet Conv Layers 3
PreNet Hidden Dim 192
PreNet Kernel Size 5
PreNet Dropout 0.5
Transformer Layers 6
Transformer Hidden Dim 192
Transformer Feed-forward Hidden Dim 768
Transformer Attention Heads 2
Transformer Dropout 0.1
Prompt Embedding Dim 192
Number of Parameters 3.37M

p0p4k · 2024-03-11T10:26:15Z

Yes, it can help. We can yank lucidrains's code. Can you do a PR?

yiwei0730 · 2024-03-12T03:13:31Z

I'm glad you think this solution useful, but I work for a company and can't upload the code to GITHUB. I will report back in time if I have any new progress.

0913ktg · 2024-03-12T06:37:16Z

Hi yiwei0730.
Thank you for your advice.
I'll do some testing and share with you.
Thank you.

0913ktg · 2024-03-13T06:10:16Z

Hello p0p4k, yiwei0730,

I have incorporated the prompt encoder part from the 'https://github.com/adelacvg/NS2VC' repository to extract prompt features for the text encoder.

The reason I chose this model is that it uses mel spectrograms for training, as opposed to ns2, which utilizes codec representations.

I plan to conduct two experiments: one adding the prompt encoder to the model structure mentioned in the paper, and another incorporating it into p0p4k's structure.

I will share the results once they are available.

yiwei0730 · 2024-03-13T09:09:49Z

If you need training assistance, I may provide some usage support. I think the ZS model with less data is also a good development strategy.
In addition, I have trained 1500hr data (zh+en) on the NS2 github you are using. It seems that the similarity is not good enough when extracting the non-training speaker set.( mos=4, but smos=3.) Not sure if it is because the codec was not used as the first step of training during the original training.

0913ktg · 2024-03-15T01:14:19Z

Hello,

I have conducted an experiment by adding the ns2 prompt encoder to the P-Flow text encoder. This was applied to both the structure provided by p0p4k and the one presented in the paper, with some noticeable differences observed.

When adding the ns2 prompt encoder to the paper's structure, there was a significant improvement in the clarity of the mel-spectrogram. There was a notable reduction in noise, and the frequency values were more distinct. However, there is an issue where the output is always in a female voice, regardless of the gender of the prompt voice (even when male voice prompts are used).

On the other hand, adding the ns2 prompt encoder to p0p4k's structure resulted in relatively more noise compared to when it was added to the structure from the paper. Additionally, the phenomenon of the output being in a male voice, as before, was observed.

In conclusion, it seems necessary to continue experimenting with the addition of the prompt encoder to the text encoder structure described in the paper and adjusting the parameters accordingly.

If there are any other models you think would be worth trying, please feel free to share.

Thank you.

0913ktg · 2024-03-15T01:16:06Z

adding the ns2 prompt encoder to the paper's structure(59epoch, 64batch)
adding the ns2 prompt encoder to p0p4k's structure(59epoch, 64batch)

p0p4k · 2024-03-15T01:50:16Z

Interesting 🤔

0913ktg · 2024-03-15T02:51:23Z

Is this model zero-shot TTS possible?

p0p4k · 2024-03-15T05:19:13Z

In pflow blog, the authors say it is possible if we use more data and bigger model size.

yiwei0730 · 2024-03-15T05:28:50Z

How many data used for the model training enough for the ZS-TTS, example for 2K hours chinese+english data?

p0p4k · 2024-03-15T05:51:49Z

Not sure, cause they didn't give the exact data they used. Audiobox paper uses around 60k hours?

0913ktg · 2024-03-15T06:05:49Z

The Korean data I used for training is 1186 hours.

yiwei0730 · 2024-03-15T06:08:42Z

Data: We train P-Flow on LibriTTS [41]. LibriTTS training set consists of 580 hours of data from 2,456 speakers. We specifically use data that is longer than 3 seconds for speech prompting, yielding a 256 hours subset. For evaluation, we follow the experiments in [37, 19] and use LibriSpeech test-clean, assuring no overlap exists with our training data. We resample all datasets to 22kHz.

I saw this in the paper ? It is just 580hr.

yiwei0730 · 2024-03-15T06:09:08Z

In pflow blog, the authors say it is possible if we use more data and bigger model size.

where is the reply blog?

0913ktg · 2024-03-15T06:13:37Z

The authors even wrote that zero-shot TTS of comparable quality to VALL-E is possible with less data.

yiwei0730 · 2024-03-15T06:24:04Z

The authors even wrote that zero-shot TTS of comparable quality to VALL-E is possible with less data.

Right! That’s why I’m following this paper.
I found that it is the only model that states that it does not require a large data set to optimize ZS, but I don’t know if it is because it uses an English data set, so it may overestimate its function.
I find that data sets similar to English can often achieve good results, but the use of East Asian languages, such as Chinese, Japanese, and Korean, does not seem to be better (example for like seamlessm4T)

p0p4k · 2024-03-15T08:55:07Z

https://pflow-demo.github.io/projects/pflow/

https://openreview.net/forum?id=zNA7u7wtIN

0913ktg · 2024-03-15T09:06:12Z

I can't play the demo audio, but p0p4k can?

yiwei0730 · 2024-03-15T10:46:51Z

I have seen this website, but the audio files cannot be played.
And his reply in Openreview should mean that more data can have better results, but a basic amount of data should also have base level performance.

p0p4k · 2024-03-15T11:15:52Z

Audio files could be played when they released the paper. If this repo doesn't give great results right now, all we can do is change the speech prompt encoder and train for longer.

p0p4k · 2024-03-17T06:09:05Z

@0913ktg can you add pos embeddings to the speech_prompt_text_encoder before transformers? I think i missed that part. Please send a PR and i will approve it. Thanks!

yiwei0730 · 2024-03-20T01:46:53Z

Hello,

I have conducted an experiment by adding the ns2 prompt encoder to the P-Flow text encoder. This was applied to both the structure provided by p0p4k and the one presented in the paper, with some noticeable differences observed.

When adding the ns2 prompt encoder to the paper's structure, there was a significant improvement in the clarity of the mel-spectrogram. There was a notable reduction in noise, and the frequency values were more distinct. However, there is an issue where the output is always in a female voice, regardless of the gender of the prompt voice (even when male voice prompts are used).

On the other hand, adding the ns2 prompt encoder to p0p4k's structure resulted in relatively more noise compared to when it was added to the structure from the paper. Additionally, the phenomenon of the output being in a male voice, as before, was observed.

In conclusion, it seems necessary to continue experimenting with the addition of the prompt encoder to the text encoder structure described in the paper and adjusting the parameters accordingly.

If there are any other models you think would be worth trying, please feel free to share.

Thank you.

I would like to ask what do you think the SMOS and MOS are after adding PE to the training? Can you implement the ZS method? What if you add the finetune method with less data?

p0p4k · 2024-03-20T04:05:56Z

I added PE in recent push, and it seems to give better ZS results. Would like to see you guys train and report as well.

yiwei0730 · 2024-03-20T09:13:27Z

Hello p0p4k,

I'm reaching out to you again with a question.

Thanks to your great help, I've successfully trained and inferred the Korean pflow model. During the inference process, I observed a few limitations, but I confirmed that satisfactory voices are synthesized for seen speakers.

I used data from about 3,000 male and female speakers, only utilizing voice files with durations longer than 4.1 seconds. I conducted distributed learning with a batch size of 64 on NVIDIA A100 40G (4 units), completing 160 epochs (500k steps).

However, when synthesizing voices using unseen speakers' voices as prompts, I found that while the voice content is well synthesized, the speakers' voices are not applied to the synthesized sound.

This phenomenon was observed for both male and female speakers, and the inference code was written referring to the synthesis.ipynb (almost identical).

I'm looking into why the speaker's voice characteristics are not applied in zero-shot inference.

If you have experienced the same issue or know anything about it, I would appreciate your help. If there's any additional information I should provide, please comment below.

Thank you.

@0913ktg sorry to bother you, I would like to ask if you can upload some synthesized audio files so that I can listen to the quality.

rishikksh20 · 2024-03-26T07:51:19Z

@0913ktg Have you tried NS2VC prompt encoder with Pflow ?

archei2500 mentioned this issue Nov 4, 2024

Quick run in Google Colab doesn't work #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about zero-shot inference #37

about zero-shot inference #37

0913ktg commented Mar 7, 2024

0913ktg commented Mar 7, 2024

p0p4k commented Mar 7, 2024

0913ktg commented Mar 7, 2024

p0p4k commented Mar 7, 2024

yiwei0730 commented Mar 11, 2024

p0p4k commented Mar 11, 2024

yiwei0730 commented Mar 12, 2024

0913ktg commented Mar 12, 2024

0913ktg commented Mar 13, 2024

yiwei0730 commented Mar 13, 2024

0913ktg commented Mar 15, 2024

0913ktg commented Mar 15, 2024 •

edited

Loading

p0p4k commented Mar 15, 2024

0913ktg commented Mar 15, 2024

p0p4k commented Mar 15, 2024

yiwei0730 commented Mar 15, 2024

p0p4k commented Mar 15, 2024

0913ktg commented Mar 15, 2024

yiwei0730 commented Mar 15, 2024

yiwei0730 commented Mar 15, 2024

0913ktg commented Mar 15, 2024

yiwei0730 commented Mar 15, 2024

p0p4k commented Mar 15, 2024

0913ktg commented Mar 15, 2024

yiwei0730 commented Mar 15, 2024

p0p4k commented Mar 15, 2024

p0p4k commented Mar 17, 2024

yiwei0730 commented Mar 20, 2024 •

edited

Loading

p0p4k commented Mar 20, 2024

yiwei0730 commented Mar 20, 2024

rishikksh20 commented Mar 26, 2024

about zero-shot inference #37

about zero-shot inference #37

Comments

0913ktg commented Mar 7, 2024

0913ktg commented Mar 7, 2024

p0p4k commented Mar 7, 2024

0913ktg commented Mar 7, 2024

p0p4k commented Mar 7, 2024

yiwei0730 commented Mar 11, 2024

p0p4k commented Mar 11, 2024

yiwei0730 commented Mar 12, 2024

0913ktg commented Mar 12, 2024

0913ktg commented Mar 13, 2024

yiwei0730 commented Mar 13, 2024

0913ktg commented Mar 15, 2024

0913ktg commented Mar 15, 2024 • edited Loading

p0p4k commented Mar 15, 2024

0913ktg commented Mar 15, 2024

p0p4k commented Mar 15, 2024

yiwei0730 commented Mar 15, 2024

p0p4k commented Mar 15, 2024

0913ktg commented Mar 15, 2024

yiwei0730 commented Mar 15, 2024

yiwei0730 commented Mar 15, 2024

0913ktg commented Mar 15, 2024

yiwei0730 commented Mar 15, 2024

p0p4k commented Mar 15, 2024

0913ktg commented Mar 15, 2024

yiwei0730 commented Mar 15, 2024

p0p4k commented Mar 15, 2024

p0p4k commented Mar 17, 2024

yiwei0730 commented Mar 20, 2024 • edited Loading

p0p4k commented Mar 20, 2024

yiwei0730 commented Mar 20, 2024

rishikksh20 commented Mar 26, 2024

0913ktg commented Mar 15, 2024 •

edited

Loading

yiwei0730 commented Mar 20, 2024 •

edited

Loading