Finetuning Multiple Voices #163
-
Howdy, i'm not sure if i'm just misunderstanding the instructions, but how does one fine tune different voices separately? Like say i have three different voices that all sound different, would i just fine tune them using the instructions, or is there a way i can split them up so the fine tunes are only for each individual voice and don't interfere with other voices? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @shadowsoze What I am about to tell you will be based on my understanding, however, I would refer you to the Coqui documentation to research all details on this: Coqui Documentation here As mentioned on the front page of the Github With that caveat out of the way... When you generate TTS with the XTTS model you are using voice sample wav file and asking it to reproduce (clone) the sound of that wav file with the text you are sending to the model. The speaker (wav file) is not embedded into the model and nor can you reference the speaker directly from within the XTTS AI models layers/neurons. The XTTS model is a "multi speaker model", designed to perform a best effort at reproducing the speech of the person provided in a wav sample file. XTTS has been trained on multiple speakers, as well as languages from the get-go. The finetuning process is teaching the model on a collection of wav files to be better at reproducing (cloning) a specific sound (voice). With finetuning you are basically nudging the model in a certain direction and saying to the model "I know you already know how to reproduce sounds for some existing speakers very well with some sample wav files, but you are not reproducing this sound from these other sample wav file very well, so I want you to learn/train your neurons to do better at this and I will give you a lot of samples and update your brain/neurons". Think of it just like a human being that you are training to reproduce different accents, the more you train on each different accent, the better that person will get. The training process is basically the same as any AI learning process:
Typically, you will train on 1x voice at a time, as this allows you to measure how well the model is learning to reproduce that speaker/voice. This is useful because if you complete all your Epochs and that speaker/voice isn't reproducing correctly from your sample wav file(s), you can further train the model with more Epochs on that collection of wav files. Whereas, if you are trying to train multiple voices at one time, you may over train the model some of those other samples, aka, its hard to ensure/test the model is learning specifically what you want it to learn if you teach it multiple things at the same time. So lets say you have trained the model on person A and you are happy with that. You now want to train it on person B. You would run a new finetuning process/round. This will teach the model to reproduce that new set of data (audio files) better. Perhaps to put that in a better way, you are now nudging some of the layers and neurons to be better at doing that. At this point you may ask questions like, what are the limits of how many different speakers you can train? Will training on multiple voices affect reproduction of some of my other wav files/samples? etc. The answer to those is a potential yes there are limits somewhere down the line and yes you may affect its quality to reproduce other speakers somewhere down the line. Where those limits are, I have no clue and it will of course be dependent on many factors, including how many rounds of training etc. The long and short though is that:
Hopefully that fills in some gaps from my understanding, however as mentioned, the Coqui documentation and research papers can all be found on the above links. Thanks |
Beta Was this translation helpful? Give feedback.
Hi @shadowsoze
What I am about to tell you will be based on my understanding, however, I would refer you to the Coqui documentation to research all details on this:
Coqui Documentation here
Coqui Discussion forum here
As mentioned on the front page of the Github
It's important to note that I am not the developer of any TTS models utilized by AllTalk, nor do I claim to be an expert on them, including understanding all their nuances, issues, and quirks. For specific TTS model concerns, I’ve provided links to the original developers in the Help section for direct assistance.
With that caveat out of the way... When you generate TTS with the XTTS model you are using voice sample wav file and a…