-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
about zero-shot inference #37
Comments
So that means the speech prompt encoder is not extracting exact speaker style info, but rather memorizing the seen speaker information. We might have to change the architecture a little bit in that case. |
Thank you for your response. If I achieve good results, I will make sure to share them with you. |
Maybe take a look at audiobox/voicebox architecture as well. |
Yes, it can help. We can yank |
I'm glad you think this solution useful, but I work for a company and can't upload the code to GITHUB. I will report back in time if I have any new progress. |
Hi yiwei0730. |
Hello p0p4k, yiwei0730, I have incorporated the prompt encoder part from the 'https://github.com/adelacvg/NS2VC' repository to extract prompt features for the text encoder. The reason I chose this model is that it uses mel spectrograms for training, as opposed to ns2, which utilizes codec representations. I plan to conduct two experiments: one adding the prompt encoder to the model structure mentioned in the paper, and another incorporating it into p0p4k's structure. I will share the results once they are available. |
If you need training assistance, I may provide some usage support. I think the ZS model with less data is also a good development strategy. |
Hello, I have conducted an experiment by adding the ns2 prompt encoder to the P-Flow text encoder. This was applied to both the structure provided by p0p4k and the one presented in the paper, with some noticeable differences observed. When adding the ns2 prompt encoder to the paper's structure, there was a significant improvement in the clarity of the mel-spectrogram. There was a notable reduction in noise, and the frequency values were more distinct. However, there is an issue where the output is always in a female voice, regardless of the gender of the prompt voice (even when male voice prompts are used). On the other hand, adding the ns2 prompt encoder to p0p4k's structure resulted in relatively more noise compared to when it was added to the structure from the paper. Additionally, the phenomenon of the output being in a male voice, as before, was observed. In conclusion, it seems necessary to continue experimenting with the addition of the prompt encoder to the text encoder structure described in the paper and adjusting the parameters accordingly. If there are any other models you think would be worth trying, please feel free to share. Thank you. |
Interesting 🤔 |
Is this model zero-shot TTS possible? |
In pflow blog, the authors say it is possible if we use more data and bigger model size. |
How many data used for the model training enough for the ZS-TTS, example for 2K hours chinese+english data? |
Not sure, cause they didn't give the exact data they used. Audiobox paper uses around 60k hours? |
The Korean data I used for training is 1186 hours. |
Data: We train P-Flow on LibriTTS [41]. LibriTTS training set consists of 580 hours of data from 2,456 speakers. We specifically use data that is longer than 3 seconds for speech prompting, yielding a 256 hours subset. For evaluation, we follow the experiments in [37, 19] and use LibriSpeech test-clean, assuring no overlap exists with our training data. We resample all datasets to 22kHz. I saw this in the paper ? It is just 580hr. |
where is the reply blog? |
I can't play the demo audio, but p0p4k can? |
I have seen this website, but the audio files cannot be played. |
Audio files could be played when they released the paper. If this repo doesn't give great results right now, all we can do is change the speech prompt encoder and train for longer. |
@0913ktg can you add pos embeddings to the speech_prompt_text_encoder before transformers? I think i missed that part. Please send a PR and i will approve it. Thanks! |
I would like to ask what do you think the SMOS and MOS are after adding PE to the training? Can you implement the ZS method? What if you add the finetune method with less data? |
I added PE in recent push, and it seems to give better ZS results. Would like to see you guys train and report as well. |
@0913ktg sorry to bother you, I would like to ask if you can upload some synthesized audio files so that I can listen to the quality. |
@0913ktg Have you tried NS2VC prompt encoder with Pflow ? |
Hello p0p4k,
I'm reaching out to you again with a question.
Thanks to your great help, I've successfully trained and inferred the Korean pflow model. During the inference process, I observed a few limitations, but I confirmed that satisfactory voices are synthesized for seen speakers.
I used data from about 3,000 male and female speakers, only utilizing voice files with durations longer than 4.1 seconds. I conducted distributed learning with a batch size of 64 on NVIDIA A100 40G (4 units), completing 160 epochs (500k steps).
However, when synthesizing voices using unseen speakers' voices as prompts, I found that while the voice content is well synthesized, the speakers' voices are not applied to the synthesized sound.
This phenomenon was observed for both male and female speakers, and the inference code was written referring to the synthesis.ipynb (almost identical).
I'm looking into why the speaker's voice characteristics are not applied in zero-shot inference.
If you have experienced the same issue or know anything about it, I would appreciate your help. If there's any additional information I should provide, please comment below.
Thank you.
The text was updated successfully, but these errors were encountered: