Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about zero-shot inference #37

Open
0913ktg opened this issue Mar 7, 2024 · 31 comments
Open

about zero-shot inference #37

0913ktg opened this issue Mar 7, 2024 · 31 comments

Comments

@0913ktg
Copy link

0913ktg commented Mar 7, 2024

Hello p0p4k,

I'm reaching out to you again with a question.

Thanks to your great help, I've successfully trained and inferred the Korean pflow model. During the inference process, I observed a few limitations, but I confirmed that satisfactory voices are synthesized for seen speakers.

I used data from about 3,000 male and female speakers, only utilizing voice files with durations longer than 4.1 seconds. I conducted distributed learning with a batch size of 64 on NVIDIA A100 40G (4 units), completing 160 epochs (500k steps).

However, when synthesizing voices using unseen speakers' voices as prompts, I found that while the voice content is well synthesized, the speakers' voices are not applied to the synthesized sound.

This phenomenon was observed for both male and female speakers, and the inference code was written referring to the synthesis.ipynb (almost identical).

I'm looking into why the speaker's voice characteristics are not applied in zero-shot inference.

If you have experienced the same issue or know anything about it, I would appreciate your help. If there's any additional information I should provide, please comment below.

Thank you.

@0913ktg
Copy link
Author

0913ktg commented Mar 7, 2024

  • unseen speaker prompt inference mel-spectrogram
    스크린샷 2024-03-07 154449
  • seen speaker prompt inference mel-spectrogram
    스크린샷 2024-03-07 154706

@p0p4k
Copy link
Owner

p0p4k commented Mar 7, 2024

So that means the speech prompt encoder is not extracting exact speaker style info, but rather memorizing the seen speaker information. We might have to change the architecture a little bit in that case.

@0913ktg
Copy link
Author

0913ktg commented Mar 7, 2024

Thank you for your response.
I will try to modify it to extract speaker characteristics in comparison with the content of the paper.

If I achieve good results, I will make sure to share them with you.
Thank you once again for your invaluable help.
Have a great day!

@p0p4k
Copy link
Owner

p0p4k commented Mar 7, 2024

Maybe take a look at audiobox/voicebox architecture as well.

@yiwei0730
Copy link

Maybe take a look at audiobox/voicebox architecture as well.

maybe the naturalspeech 2 speech prompt encoder can help?
but I'm not sure its really be useful, Do you have any suggestions?
1

NS2 Speech Prompt Encoder
Transformer Layer 6
Attention Heads 8
Hidden Size 512
Conv1D Filter Size 2048
Conv1D Kernel Size 9
Dropout 0.2
Parameters 69M

p-flow Speech-prompted Text Encoder
Phoneme Embedding Dim 192
PreNet Conv Layers 3
PreNet Hidden Dim 192
PreNet Kernel Size 5
PreNet Dropout 0.5
Transformer Layers 6
Transformer Hidden Dim 192
Transformer Feed-forward Hidden Dim 768
Transformer Attention Heads 2
Transformer Dropout 0.1
Prompt Embedding Dim 192
Number of Parameters 3.37M

@p0p4k
Copy link
Owner

p0p4k commented Mar 11, 2024

Yes, it can help. We can yank lucidrains's code. Can you do a PR?

@yiwei0730
Copy link

I'm glad you think this solution useful, but I work for a company and can't upload the code to GITHUB. I will report back in time if I have any new progress.

@0913ktg
Copy link
Author

0913ktg commented Mar 12, 2024

Hi yiwei0730.
Thank you for your advice.
I'll do some testing and share with you.
Thank you.

@0913ktg
Copy link
Author

0913ktg commented Mar 13, 2024

Hello p0p4k, yiwei0730,

I have incorporated the prompt encoder part from the 'https://github.com/adelacvg/NS2VC' repository to extract prompt features for the text encoder.

The reason I chose this model is that it uses mel spectrograms for training, as opposed to ns2, which utilizes codec representations.

I plan to conduct two experiments: one adding the prompt encoder to the model structure mentioned in the paper, and another incorporating it into p0p4k's structure.

I will share the results once they are available.

@yiwei0730
Copy link

If you need training assistance, I may provide some usage support. I think the ZS model with less data is also a good development strategy.
In addition, I have trained 1500hr data (zh+en) on the NS2 github you are using. It seems that the similarity is not good enough when extracting the non-training speaker set.( mos=4, but smos=3.) Not sure if it is because the codec was not used as the first step of training during the original training.

@0913ktg
Copy link
Author

0913ktg commented Mar 15, 2024

Hello,

I have conducted an experiment by adding the ns2 prompt encoder to the P-Flow text encoder. This was applied to both the structure provided by p0p4k and the one presented in the paper, with some noticeable differences observed.

When adding the ns2 prompt encoder to the paper's structure, there was a significant improvement in the clarity of the mel-spectrogram. There was a notable reduction in noise, and the frequency values were more distinct. However, there is an issue where the output is always in a female voice, regardless of the gender of the prompt voice (even when male voice prompts are used).

On the other hand, adding the ns2 prompt encoder to p0p4k's structure resulted in relatively more noise compared to when it was added to the structure from the paper. Additionally, the phenomenon of the output being in a male voice, as before, was observed.

In conclusion, it seems necessary to continue experimenting with the addition of the prompt encoder to the text encoder structure described in the paper and adjusting the parameters accordingly.

If there are any other models you think would be worth trying, please feel free to share.

Thank you.

@0913ktg
Copy link
Author

0913ktg commented Mar 15, 2024

  • adding the ns2 prompt encoder to the paper's structure(59epoch, 64batch)
    image
  • adding the ns2 prompt encoder to p0p4k's structure(59epoch, 64batch)
    image

@p0p4k
Copy link
Owner

p0p4k commented Mar 15, 2024

Interesting 🤔

@0913ktg
Copy link
Author

0913ktg commented Mar 15, 2024

Is this model zero-shot TTS possible?

@p0p4k
Copy link
Owner

p0p4k commented Mar 15, 2024

In pflow blog, the authors say it is possible if we use more data and bigger model size.

@yiwei0730
Copy link

How many data used for the model training enough for the ZS-TTS, example for 2K hours chinese+english data?

@p0p4k
Copy link
Owner

p0p4k commented Mar 15, 2024

Not sure, cause they didn't give the exact data they used. Audiobox paper uses around 60k hours?

@0913ktg
Copy link
Author

0913ktg commented Mar 15, 2024

The Korean data I used for training is 1186 hours.

@yiwei0730
Copy link

Data: We train P-Flow on LibriTTS [41]. LibriTTS training set consists of 580 hours of data from 2,456 speakers. We specifically use data that is longer than 3 seconds for speech prompting, yielding a 256 hours subset. For evaluation, we follow the experiments in [37, 19] and use LibriSpeech test-clean, assuring no overlap exists with our training data. We resample all datasets to 22kHz.

I saw this in the paper ? It is just 580hr.

@yiwei0730
Copy link

In pflow blog, the authors say it is possible if we use more data and bigger model size.

where is the reply blog?

@0913ktg
Copy link
Author

0913ktg commented Mar 15, 2024

The authors even wrote that zero-shot TTS of comparable quality to VALL-E is possible with less data.
image

@yiwei0730
Copy link

The authors even wrote that zero-shot TTS of comparable quality to VALL-E is possible with less data. image

Right! That’s why I’m following this paper.
I found that it is the only model that states that it does not require a large data set to optimize ZS, but I don’t know if it is because it uses an English data set, so it may overestimate its function.
I find that data sets similar to English can often achieve good results, but the use of East Asian languages, such as Chinese, Japanese, and Korean, does not seem to be better (example for like seamlessm4T)

@p0p4k
Copy link
Owner

p0p4k commented Mar 15, 2024

https://pflow-demo.github.io/projects/pflow/
image
https://openreview.net/forum?id=zNA7u7wtIN
image

@0913ktg
Copy link
Author

0913ktg commented Mar 15, 2024

I can't play the demo audio, but p0p4k can?

@yiwei0730
Copy link

I have seen this website, but the audio files cannot be played.
And his reply in Openreview should mean that more data can have better results, but a basic amount of data should also have base level performance.

@p0p4k
Copy link
Owner

p0p4k commented Mar 15, 2024

Audio files could be played when they released the paper. If this repo doesn't give great results right now, all we can do is change the speech prompt encoder and train for longer.

@p0p4k
Copy link
Owner

p0p4k commented Mar 17, 2024

@0913ktg can you add pos embeddings to the speech_prompt_text_encoder before transformers? I think i missed that part. Please send a PR and i will approve it. Thanks!

@yiwei0730
Copy link

yiwei0730 commented Mar 20, 2024

Hello,

I have conducted an experiment by adding the ns2 prompt encoder to the P-Flow text encoder. This was applied to both the structure provided by p0p4k and the one presented in the paper, with some noticeable differences observed.

When adding the ns2 prompt encoder to the paper's structure, there was a significant improvement in the clarity of the mel-spectrogram. There was a notable reduction in noise, and the frequency values were more distinct. However, there is an issue where the output is always in a female voice, regardless of the gender of the prompt voice (even when male voice prompts are used).

On the other hand, adding the ns2 prompt encoder to p0p4k's structure resulted in relatively more noise compared to when it was added to the structure from the paper. Additionally, the phenomenon of the output being in a male voice, as before, was observed.

In conclusion, it seems necessary to continue experimenting with the addition of the prompt encoder to the text encoder structure described in the paper and adjusting the parameters accordingly.

If there are any other models you think would be worth trying, please feel free to share.

Thank you.

I would like to ask what do you think the SMOS and MOS are after adding PE to the training? Can you implement the ZS method? What if you add the finetune method with less data?

@p0p4k
Copy link
Owner

p0p4k commented Mar 20, 2024

I added PE in recent push, and it seems to give better ZS results. Would like to see you guys train and report as well.

@yiwei0730
Copy link

Hello p0p4k,

I'm reaching out to you again with a question.

Thanks to your great help, I've successfully trained and inferred the Korean pflow model. During the inference process, I observed a few limitations, but I confirmed that satisfactory voices are synthesized for seen speakers.

I used data from about 3,000 male and female speakers, only utilizing voice files with durations longer than 4.1 seconds. I conducted distributed learning with a batch size of 64 on NVIDIA A100 40G (4 units), completing 160 epochs (500k steps).

However, when synthesizing voices using unseen speakers' voices as prompts, I found that while the voice content is well synthesized, the speakers' voices are not applied to the synthesized sound.

This phenomenon was observed for both male and female speakers, and the inference code was written referring to the synthesis.ipynb (almost identical).

I'm looking into why the speaker's voice characteristics are not applied in zero-shot inference.

If you have experienced the same issue or know anything about it, I would appreciate your help. If there's any additional information I should provide, please comment below.

Thank you.

@0913ktg sorry to bother you, I would like to ask if you can upload some synthesized audio files so that I can listen to the quality.

@rishikksh20
Copy link

@0913ktg Have you tried NS2VC prompt encoder with Pflow ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants