Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query regarding SoundStorm USLM implementation #1

Open
rishikksh20 opened this issue Sep 11, 2023 · 7 comments
Open

Query regarding SoundStorm USLM implementation #1

rishikksh20 opened this issue Sep 11, 2023 · 7 comments

Comments

@rishikksh20
Copy link

@ZhangXInFD Are you simply replaced the 'NAR' of USLM with trained SoundStorm speech tokenizer for zero shot TTS task ?
Although quality of SoundStorm is much better Have you notice any speed advantages while using SoundStorm compare to original USLM ?

@rishikksh20
Copy link
Author

By the way thanks for training code implementation.

@ZhangXInFD
Copy link
Owner

Thanks for your attention!
For the first question, yes, we just simply replace the 'NAR' of USLM for zero-shot TTS task. Compared to VALL-E, the stage2 of USLM can be viewed as a semantic -> acoustic process. Therefore, we can apply advanced semantic -> acoustic techniques like SoundStorm in the stage2 to enhance the audio generation quality. This is one of the advantages of SpeechTokenizer over SoundStream and Encodec. On the other hand, compared to the genuine semantic (like HuBERT, W2V-BERT) -> acoustic (like SoundStream, Encodec) process, Benefiting from information decoupling, SoundStorm requires fewer iterations when applied to SpeechTokenizer. In fact, in our experiments, a single iteration yielded quite satisfactory generation quality.
For the second question, we have not evaluated time costs of 'NAR' and SoundStorm. But in theory, since SoundStorm also needs to generate tokens layer by layer in inference, its time complexity should be on the same order of magnitude as NAR. Moreover, if SoundStorm iterates multiple times when decoding the first layer (i.e., RVQ-2), then theoretically it would take more time than 'NAR'. In our experiments, SoundStorm only iterates 1 time when decoding RVQ-2. If SoundStorm were to generate all tokens at once, its time efficiency might be higher than NAR's. However, we have not yet evaluated the audio quality produced in this manner. Once the model is fully trained in the future, we might conduct related experiments. In fact, as of now, our SoundStorm hasn't been trained to its full potential, but the results are already quite promising. The biggest advantage of SoundStorm over 'NAR' lies in the quality of the audio generation.

@rishikksh20
Copy link
Author

Yes Soundstorm yield better quality due to use of conformer, I don't aspect any speed quality as well. When I get time and resource I will train SpeechTokeizer and USLM (Soundstorm) on large LibriLight, MLS and Gigaspeech dataset, I think it will yeild production level quality. Meanwhile please share SpeechTokeizer training code if possible.
Please do share fully trained sample here.

@ZhangXInFD
Copy link
Owner

We will soon release a SpeechTokenizer trained on a larger dataset. But the open-sourcing of the training code might face some delays. This is due to the semantic distillation process during training which required modifications to the relevant model code within fairseq. Organizing this code and contemplating the most suitable way to release it might take a significant amount of time. Given our other ongoing projects, we cannot currently estimate a timeline for the release of the training code.
Some samples of voice conversion and umpromt generation are provided in here.

@lifeiteng
Copy link

We will soon release a SpeechTokenizer trained on a larger dataset. But the open-sourcing of the training code might face some delays. This is due to the semantic distillation process during training which required modifications to the relevant model code within fairseq. Organizing this code and contemplating the most suitable way to release it might take a significant amount of time. Given our other ongoing projects, we cannot currently estimate a timeline for the release of the training code. Some samples of voice conversion and umpromt generation are provided in here.

When will this model weight be released?

@0417keito
Copy link

When you replaced VALL-E's NAR with Soundstorm, did you adopt SoundStorm's mask strategy? Or did you not change your mask strategy?

@ZhangXInFD
Copy link
Owner

@0417keito We adopt SoundStorm's mask strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants