-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it multi-language? #2
Comments
I think it is possible to do it. Ill do it after I am sure this version of the model works at least for one language. |
LJSpeech sample sounds promissing. Will you be able to reuse weights for multi speaker (VCTK?) training? If "yes" I'll start training for single speaker dataset (non English). |
Yes, can reuse. |
According to RADMMM title of issue/wish should be Make it Multiaccented. |
True, I am doing a multi-speaker training on my end as well, let's see if the generations are good enough without extra conditioning first. Good luck! |
Does the training of multi-speaker (VCTK) look good @p0p4k ? |
VCTK should work but it should be easier to fit LibriTTS. The main issue with VCTK is that it there's a lot of silence at the beginning and end of some samples and automatic trimming methods are normally not accurate and end up trimming phonemes. |
LibriTTS sounds like this @ 200k steps with guided sampling - https://voca.ro/1e0tSbWgbyuu |
@p0p4k sample sounds good, I think with more training it will getting lot better. I think multi-linguility is easy to implement in this repo. I think problem occurs when you use one language native speaker prompt and generate other language speech. |
On the other note, can adding some noise in the prompt help the model to extract "voice" better? Since I tried a zero-shot voice clone and it didn't perform that well. |
I was wondering if "injecting" language info would be possible. Something similar to what xtts is doing by injecting special language token e.g. [en] for GPT input.
Features from 3-sec speech prompt might not be enough (nor desired) to capture language of sample text (in order to do cross language speaker cloning). However concatenating "speech prompt" with some kind of language id (precomputed language features vector?) might enable ML (as multi-language) in addition to MS.
At inference changing this prompt part might enable inline language switching.
There might be better way of course. E.g. passing info directly to encoder PreNet? Anyway it wold be great to see this feature. VITS based YourTTS does similar thing.
The text was updated successfully, but these errors were encountered: