Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it multi-language? #2

Open
zidsi opened this issue Nov 13, 2023 · 10 comments
Open

Make it multi-language? #2

zidsi opened this issue Nov 13, 2023 · 10 comments

Comments

@zidsi
Copy link

zidsi commented Nov 13, 2023

I was wondering if "injecting" language info would be possible. Something similar to what xtts is doing by injecting special language token e.g. [en] for GPT input.

Features from 3-sec speech prompt might not be enough (nor desired) to capture language of sample text (in order to do cross language speaker cloning). However concatenating "speech prompt" with some kind of language id (precomputed language features vector?) might enable ML (as multi-language) in addition to MS.

At inference changing this prompt part might enable inline language switching.

There might be better way of course. E.g. passing info directly to encoder PreNet? Anyway it wold be great to see this feature. VITS based YourTTS does similar thing.

@p0p4k
Copy link
Owner

p0p4k commented Nov 13, 2023

I think it is possible to do it. Ill do it after I am sure this version of the model works at least for one language.

@zidsi
Copy link
Author

zidsi commented Nov 15, 2023

LJSpeech sample sounds promissing. Will you be able to reuse weights for multi speaker (VCTK?) training? If "yes" I'll start training for single speaker dataset (non English).

@p0p4k
Copy link
Owner

p0p4k commented Nov 16, 2023

Yes, can reuse.

@zidsi
Copy link
Author

zidsi commented Nov 20, 2023

According to RADMMM title of issue/wish should be Make it Multiaccented.
Authors say:"We refer to our conditioning as accent instead of language,
because we consider language to be implicit in the phoneme
sequence. "

But let's first see how well 3sec conditioning works for multispeaker.

@p0p4k
Copy link
Owner

p0p4k commented Nov 20, 2023

True, I am doing a multi-speaker training on my end as well, let's see if the generations are good enough without extra conditioning first. Good luck!

@vuong-ts
Copy link

vuong-ts commented Dec 2, 2023

Does the training of multi-speaker (VCTK) look good @p0p4k ?

@rafaelvalle
Copy link

VCTK should work but it should be easier to fit LibriTTS. The main issue with VCTK is that it there's a lot of silence at the beginning and end of some samples and automatic trimming methods are normally not accurate and end up trimming phonemes.
Accent and language control should be possible with one hot embeddings. VCTK and CML-Dataset are great candidates.

@p0p4k
Copy link
Owner

p0p4k commented Dec 2, 2023

LibriTTS sounds like this @ 200k steps with guided sampling - https://voca.ro/1e0tSbWgbyuu

@rishikksh20
Copy link

@p0p4k sample sounds good, I think with more training it will getting lot better. I think multi-linguility is easy to implement in this repo. I think problem occurs when you use one language native speaker prompt and generate other language speech.

@p0p4k
Copy link
Owner

p0p4k commented Dec 2, 2023

On the other note, can adding some noise in the prompt help the model to extract "voice" better? Since I tried a zero-shot voice clone and it didn't perform that well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants