Finetuning Multiple Voices #163

shadowsoze · 2024-04-03T03:31:48Z

shadowsoze
Apr 3, 2024

Howdy, i'm not sure if i'm just misunderstanding the instructions, but how does one fine tune different voices separately? Like say i have three different voices that all sound different, would i just fine tune them using the instructions, or is there a way i can split them up so the fine tunes are only for each individual voice and don't interfere with other voices?

Answered by erew123

Apr 3, 2024

Hi @shadowsoze

What I am about to tell you will be based on my understanding, however, I would refer you to the Coqui documentation to research all details on this:

Coqui Documentation here
Coqui Discussion forum here

As mentioned on the front page of the Github It's important to note that I am not the developer of any TTS models utilized by AllTalk, nor do I claim to be an expert on them, including understanding all their nuances, issues, and quirks. For specific TTS model concerns, I’ve provided links to the original developers in the Help section for direct assistance.

With that caveat out of the way... When you generate TTS with the XTTS model you are using voice sample wav file and a…

View full answer

erew123 · 2024-04-03T09:18:28Z

erew123
Apr 3, 2024
Maintainer

Hi @shadowsoze

What I am about to tell you will be based on my understanding, however, I would refer you to the Coqui documentation to research all details on this:

Coqui Documentation here
Coqui Discussion forum here

As mentioned on the front page of the Github It's important to note that I am not the developer of any TTS models utilized by AllTalk, nor do I claim to be an expert on them, including understanding all their nuances, issues, and quirks. For specific TTS model concerns, I’ve provided links to the original developers in the Help section for direct assistance.

With that caveat out of the way... When you generate TTS with the XTTS model you are using voice sample wav file and asking it to reproduce (clone) the sound of that wav file with the text you are sending to the model. The speaker (wav file) is not embedded into the model and nor can you reference the speaker directly from within the XTTS AI models layers/neurons.

The XTTS model is a "multi speaker model", designed to perform a best effort at reproducing the speech of the person provided in a wav sample file. XTTS has been trained on multiple speakers, as well as languages from the get-go.

The finetuning process is teaching the model on a collection of wav files to be better at reproducing (cloning) a specific sound (voice). With finetuning you are basically nudging the model in a certain direction and saying to the model "I know you already know how to reproduce sounds for some existing speakers very well with some sample wav files, but you are not reproducing this sound from these other sample wav file very well, so I want you to learn/train your neurons to do better at this and I will give you a lot of samples and update your brain/neurons". Think of it just like a human being that you are training to reproduce different accents, the more you train on each different accent, the better that person will get.

The training process is basically the same as any AI learning process:

Here is a collection of data, in our case audio files, listen to them, aka input the data to the layers of the model.
Next we use the eval files, and say try to reproduce these with what you have learned. This allows the training process to say how well the model has done on the last training round (epoch).

Typically, you will train on 1x voice at a time, as this allows you to measure how well the model is learning to reproduce that speaker/voice. This is useful because if you complete all your Epochs and that speaker/voice isn't reproducing correctly from your sample wav file(s), you can further train the model with more Epochs on that collection of wav files. Whereas, if you are trying to train multiple voices at one time, you may over train the model some of those other samples, aka, its hard to ensure/test the model is learning specifically what you want it to learn if you teach it multiple things at the same time.

So lets say you have trained the model on person A and you are happy with that. You now want to train it on person B. You would run a new finetuning process/round. This will teach the model to reproduce that new set of data (audio files) better. Perhaps to put that in a better way, you are now nudging some of the layers and neurons to be better at doing that.

At this point you may ask questions like, what are the limits of how many different speakers you can train? Will training on multiple voices affect reproduction of some of my other wav files/samples? etc. The answer to those is a potential yes there are limits somewhere down the line and yes you may affect its quality to reproduce other speakers somewhere down the line. Where those limits are, I have no clue and it will of course be dependent on many factors, including how many rounds of training etc.

The long and short though is that:

You can train multiple voices.
You should do that as separate training runs.
You should be careful not to over train the model with too many epochs (e.g. 1000 epochs is the amount you would train to learn an entirely new language in 1x training run, which may affect other voice reproduction). Its up to you to test as you go.
Its up to you to judge what is working correctly (based on my reading of various discussions in the Coqui forum from the staff of Coqui).

Hopefully that fills in some gaps from my understanding, however as mentioned, the Coqui documentation and research papers can all be found on the above links.

Thanks

1 reply

shadowsoze Apr 4, 2024
Author

Awesome, thank you so much for the links and the writeup, it helps clear some things up. Coming from MRQ/TorTisE some concepts and things are the same but how things are handled here are different so just trying to get a handle on everything to create better outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning Multiple Voices #163

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Finetuning Multiple Voices #163

shadowsoze Apr 3, 2024

Replies: 1 comment · 1 reply

erew123 Apr 3, 2024 Maintainer

shadowsoze Apr 4, 2024 Author

shadowsoze
Apr 3, 2024

Replies: 1 comment 1 reply

erew123
Apr 3, 2024
Maintainer

shadowsoze Apr 4, 2024
Author