AllTalk version 2 progress. #211

erew123 · 2024-05-09T12:45:00Z

erew123
May 9, 2024
Maintainer

For the past month or so, I've been working on version 2 of AllTalk. The plan with version 2 is to try get issues/annoyances/limitations resolved that appear in version 1, as well as expand AllTalk's simplicity, feature set etc.

Screenshots a few posts down

Please note, what you see below is NOT a finished version. Its currently a functional version just so I can work on code in the back end

Things I have achieved so far in the code:

Introduced a Gradio management interface. Because of all the complexity of the settings you will have access to, Gradio is a simple interface for managing all of these settings. However, if you want to turn off loading the gradio interface, you will have the option to do that. Also I aim to make the gradio interface modular, so you can choose what sections of it would load in. This will allow the option to keep things as light weight as possible as/where you need.
Split out model loaders, not only to make model selection/loading easier. AllTalk now discovers all available models to load and passes that down through the API. Because of how I've split this out, this now allows for the possibility of adding (in theory) any TTS engine into AllTalk. You will also be able to set certain default values on a TTS engine by TTS engine basis. I'm 80% of the way through this section of code, at which point I can test adding a few new TTS engines in.

I intend to make a well documented template so that anyone with a little coding experience should be able to set up a new TTS engine within AllTalk.

I've put a lot of effort into reworking the API. You will now be able to set default settings for the API or change the behaviour of the API. The idea with this is not only can people who wish to customise certain settings, can do so simply, but also it reduces the commands that have to be sent over for TTS generation. On AllTalk v1 you would have to send the whole TTS command over to AllTalk to make it generate TTS e.g.

curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=All of this is text spoken by the character. This is text not inside quotes, though that doesnt matter in the slightest" -d "text_filtering=standard" -d "character_voice_gen=female_01.wav" -d "narrator_enabled=false" -d "narrator_voice_gen=male_01.wav" -d "text_not_inside=character" -d "language=en" -d "output_file_name=myoutputfile" -d "output_file_timestamp=true" -d "autoplay=true" -d "autoplay_volume=0.8"

With the new API and model settings breakout, you can just send over the sections you want to the API and the rest of the settings, where not specified, will be pulled from the API default settings and TTS engine settings. So you will now be able to send over a request as simple as the following to have TTS generation:

curl -X POST "http://127.0.0.1:7851/api/tts-generate" -d "text_input=All of this is text spoken by the character. This is text not inside quotes, though that doesn't matter in the slightest"

You will be able to add in any of the other available settings in any combinations you wish, with the missing settings always being pulled from the defaults where they aren't specified. Having this new ability can simplify the development of other products/applications being able to work with AllTalk.

To test through all these settings are working correctly, I've built a new Text-gen-webui extension that will connect to AllTalk as a remote system (meaning the AllTalk server could be running on anything that isn't the local machine). This is allowing me to check that all the settings from the AllTalk server are passing down correctly, communication is working correctly, variables/files are shared/updated correctly etc.

Screenshot a few posts down

Im spending a lot of time ensuring the code is documented and clear to read e.g. variable names will be identical across scripts and also tagged with the area the code relates to e.g. lets say a block of code is working with Text-gen-webui, the variables would be tgwui_variablename meaning that you don't confuse that variable name with another variable name within the script. Im also intending to put debugging code/print-out into AllTalk. The idea is that anyone wanting to update/work with AllTalks code, should be able to read it/understand.

There are lots of other little changes/updates. Too numerous to mention. I still have a decent way to go before there is a BETA version of AllTalk to play with. I would estimate anywhere between 100 to 200 hours of work, between coding, documenting, testing etc.

Things I still have to work on:

OpenAI v1 Speech API compatibility. (20% done)
Some rework on queue management
Reworking add-ins like SillyTavern
Docker & Google Colab (I've tested a few things pre starting work on v2 so Im quite confident this shouldn't be too difficult with v2)
Integrating finetuning & bits of rework
TTS engine/model downloaders/installers.
Debugging checkboxes and code.
Modular loading of gradio sections.
Lots of other little things that are just too numerous and minor to mention. (please also see the features list).

I am not at this time asking for additional feature ideas OR TTS engines to integrate into V2. I'm trying to get the base code working, stable and clear to understand, with a few additional TTS engines added in,

I hope at some point to release a BETA version and then Im happy to take feedback and work on bugs, along with other possible features/integrations.

Thanks

RenNagasaki · 2024-05-10T13:35:39Z

RenNagasaki
May 10, 2024

Ohh that sounds really promising. I'm getting more and more excited. 😁
If you need another Tester, feel free to reach out to me. You can also find me on discord (same name). 😜

0 replies

erew123 · 2024-05-16T22:13:00Z

erew123
May 16, 2024
Maintainer Author

This week (so far):

Introduced audio Transcoding (not for streaming mind)
Spent ??? hours cleaning up code & interface text
Got the openAI compatible API working at a basic level
Lots of re-visiting code pass through between scripts & API's
Worked on optimising the code where possible (shaved 1/3rd off the start-up time vs V1 of AllTalk, which is a godsend when you have to start and stop it so many times).

0 replies

brianragle · 2024-05-17T09:54:08Z

brianragle
May 17, 2024

I'm a novice but I have an M3 Max Macbook and would volunteer to help test, if you need it.

1 reply

erew123 May 17, 2024
Maintainer Author

Hi @brianragle You will be welcome to, though I have to say that Mac's have been a struggle for me. Not because I hate mac or anything, its just I don't have one or have access to one, so Ive not been able to implement Mac Metal support (like CUDA, but for Mac). Plus also, the Metal support has been a bit hit and miss with their development over the last few months and not much documentation/examples out there, so that's making it double hard for me to attempt full Metal support.

Ill put this out here now, if anyone in future knows a bit of coding and wants to attempt to get mac metal support integrated, you would be welcome to take a shot when I have v2 out. I dont actually think its too bad, I think you just call "metal" or something in place of "cuda"... but I cant say for certain.

So as it stands, AllTalk v2 should work as CPU based on mac (I think it will default to CPU, but maybe apple have built a compatibility layer with CUDA).

You will be more than welcome to try it out and I will post a general BETA out at some point for those who want to try it. Some things on Mac may require a little extra legwork though.

Thanks

erew123 · 2024-05-17T23:18:44Z

erew123
May 17, 2024
Maintainer Author

Gradio interface, this far. Still lots to add in, tidy up & work on.

A lot of my recent work has been on the API and simplifying passing all the settings to/from a TTS engine in a simple way to the rest of AllTalk or even out through the AllTalk API. Of course this also means that settings changes have to be dynamically passed across scripts when settings are updated and also out to anything accessing AllTalk over its API. Its been quite a mental challenge.

Managed to add in silent narrator and also silent text-not-inside options

Generate Screen

Basic settings

API Settings

This allows you to alter certain behaviours of the API. Also, if you make a request to the API, but don't specify all the possible settings in the request, AllTalk will now use the default settings you specify here.

TGWUI Settings

TTS Engine settings example

In here you can specify the default behaviours for the/each TTS engine that gets added. This twins with the API settings, whereby if you don't specify a setting on an API request, the default settings specified for the engine will be used.

TTS Model information example

TGWUI Remote extension

You can see that Pitch is disabled as this model/engine doesn't support pitch. If you load in an engine that does/doesn't support certain features/abilities, When you refresh the settings, this will liven up or lock out settings as necessary.

12 replies

erew123 May 28, 2024
Maintainer Author

Great News & an RVC demo

Lets start with the great news, I have a fully updated version of the Coqui TTS engines! Hopefully they will perform better, and maybe even support some other TTS engines! I've spent my morning trashing my entire Python environment, something Ive been dreading, and then rebuilding it and seeing if I can get everything re-installed, all the correct packages and build up a new requirements file, which I have, at least on Windows. So thats potentially one more thing out the way (the requirements file).

Aside from that, I have an RVC audio demo. https://voca.ro/1hO4V4azKoSY

The first one is a younger Emma Watson (Hermione Watson), pushed through the XTTS engine on a non-finetuned model. I figure its a voice people will recognise. The second half of the audio is exactly the same, but it then goes through RVC after the TTS generation, using an older sounding Emma Watson.

Paladinium Oct 16, 2024

@erew123 : I looked into RVC in the context of the Dockerfile to enable it by default. One thing I noted that it is quite slow. I have no solution as I don't fully understand the code and neither how RVC works in general.

I tracked this down using a GPU using some print statements (the final audio output was only 10s long):

The conversion itself is quite fast Conversion completed. Output file: '...' in 0.79 seconds.
- Comparing this to this comment on the RVC-Project, the measured time is a factor of 2 slower, but this can be due to different hardware. The magnitude however seems right at first glance.
Nevertheless, the entire process is reported to take much longer: RVC Convert : 2.90 seconds. Method: rmvpe Index size used 45000
The difference of 2.11s happens when get_vc is called in infer.py where the majority of time (in this case 1.8s) goes to invoking the VC constructor.

Here are some debug logs:

[AllTalk Debug] ************************************
[AllTalk Debug] run_rvc function called (debug_rvc)
[AllTalk Debug] ***********************************
[AllTalk Debug] f0up_key        : 0
[AllTalk Debug] filter_radius   : 3
[AllTalk Debug] index_rate      : 0.75
[AllTalk Debug] rms_mix_rate    : 1
[AllTalk Debug] protect         : 0.5
[AllTalk Debug] hop_length      : 130
[AllTalk Debug] f0method        : rmvpe
[AllTalk Debug] input_tts_path  : /alltalk/outputs/myoutputfile_172905497139bca.wav
[AllTalk Debug] output_rvc_path : /alltalk/outputs/myoutputfile_172905497139bca.wav
[AllTalk Debug] pth_path        : /alltalk/models/rvc_voices/john_doe/model.pth
[AllTalk Debug] index_path      : /alltalk/models/rvc_voices/john_doe/added_IVF1892_Flat_nprobe_1_JohnDoeTalking_v2.index
[AllTalk Debug] split_audio     : True
[AllTalk Debug] f0autotune      : False
[AllTalk Debug] embedder_model  : hubert
[AllTalk Debug] training_data_size: 45000
[AllTalk Debug] Entering get_vc function
[AllTalk Debug] weight_root: /alltalk/models/rvc_voices/john_doe/model.pth
[AllTalk Debug] sid: 0
[AllTalk Debug] file_index: /alltalk/models/rvc_voices/john_doe/added_IVF1892_Flat_nprobe_1_JohnDoeTalking_v2.index
[AllTalk Debug] training_data_size: 45000
[AllTalk Debug] Loading model checkpoint
[AllTalk Debug] Model version: v2
[AllTalk Debug] Loading index file: /alltalk/models/rvc_voices/john_doe/added_IVF1892_Flat_nprobe_1_JohnDoeTalking_v2.index
[AllTalk Debug] Extracting data from index
[AllTalk Debug] Creating VC instance
[AllTalk Debug] Data: [[-0.08410645 -0.44702148 ...]]
[AllTalk Debug] Dimensionality: 768
[AllTalk Debug] Number of centroids: 1892
[AllTalk Debug] Training data size: 45000
[AllTalk Debug] Initializing FAISS quantizer with dimensionality: 768
[AllTalk Debug] FAISS quantizer initialized: <faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x7bc9a46ecf30> >
[AllTalk Debug] Using nlist: 1153
[AllTalk Debug] FAISS IndexIVFFlat initialized: <faiss.swigfaiss_avx2.IndexIVFFlat; proxy of <Swig Object of type 'faiss::IndexIVFFlat *' at 0x7bc9a46ecf60> >
[AllTalk Debug] Training FAISS index with 45000 training points and 1153 centroids
[AllTalk Debug] FAISS index trained
[AllTalk Debug] Adding 73815 data points to FAISS index
[AllTalk Debug] Data added to FAISS index
[AllTalk Debug] Leaving get_vc function
Conversion completed. Output file: '/alltalk/outputs/myoutputfile_172905497139bca.wav' in 0.79 seconds.
[AllTalk GEN] mRVC Convert Checkpoint DDD: 2.90 seconds.
[AllTalk GEN] RVC Convert : Model: model.pth Index: added_IVF1892_Flat_nprobe_1_JohnDoeTalking_v2.index
[AllTalk GEN] RVC Convert : 2.90 seconds. Method: rmvpe Index size used 45000

Here are some observations:

atsetup uses faiss-cpu. The documentation states that "faiss-gpu, containing both CPU and GPU indices" which raises the question whether just using faiss-gpu would be OK even for CPU only environments?
There would be GPU classes for index operations in the VC constructor such as GpuIndexFlatL2 instead of IndexFlatL2. I tried to use them, but got an error.
Not sure which device is used here. There would be the option to pick a cuda device.
Sure training of the FAISS index has its purpose. If it is absolutely required, could there be some caching per RVC voice model to avoid such an expensive operation over and over again?
Maybe this link provides useful information.

What are your thoughts on this? I can also create an issue if you want, but I hesitated since v2 is beta.

Unrelated question: Why is the streaming API not supporting RVC?

erew123 Oct 16, 2024
Maintainer Author

Hi @Paladinium With RVC its the initial load of RVC into a Python executable and then the indexing of the RVC voice that takes the time. You can only do realtime RVC conversion if you don't swap the voice out, which because of the way people may be using AllTalk (say for narration of different voices in SillyTavern) the voices will change between different line generations, hence the need to re-load and re-index the RVC voice.

The more you index, the longer it takes, is the basic rule.

I would have loved to use faiss-gpu, however, here is the issue with that. Faiss-gpu is not (currently) available as a pre-built package for Python 3.11 or later. (See down the left hand column here https://pypi.org/project/faiss-gpu/)

However, some TTS engines and components are now requiring Python 3.11+ which I had already committed to, pre integrating RVC. They do update the faiss-cpu wheel files https://pypi.org/project/faiss-cpu/ but not the gpu wheels. Why I dont know.

In theory you can build a faiss-gpu wheel file for later versions of Python, though I have not tried this and of course, I would have to test building wheel files for both Windows & Linux, potentially code up a way to make it happen automatically, write documentation, possibly write something for Mac and also figure out any error messages. So it can become quite an undertaking as every time I add something new, I often end up with xxx amount of support requests or it works on Windows 11, but not 10, or something like that.

There is also a tie into fairseq with RVC and I cannot for the life of me remember if that requires a specific version of faiss or if the two are completely independent but I had to use a specific wheel file https://github.com/erew123/alltalk_tts/tree/alltalkbeta/system/config

Finally I had to re-write a large chunk of the RVC indexing to make it work on Python 3.11+. Technically speaking, official RVC doesnt actually work on anything later that Python 3.10 (unless there has been a change).

So that's the backstory/current situation, but yes, Id love to move to faiss-gpu and you have clearly looked into it deeper than I have, so you may well be picking up on certain things I havnt been able to. Obviously I am also restricted to what I can do currently (I think I mentioned the other day on our other conversation). If you have some ideas or want to give things a test in some way, great, any help would be appreciated.

Re Streaming API. Its just not something I have looked into. Some TTS engines support streaming and some dont. Piper claims it does, but I have had a hell of a job trying to get it to work (as have others based on Piper's forums). Obviously some tts engines that may get added in future will or wont support streaming. It is of course the developer's code that generates the streaming, not AllTalk directly, so they may all work differently.

In theory, it should be possible to start an RVC voice/pipeline, feed the audio from streaming into RVC and then re-stream it out to the end user, though there would be updates to AllTalk's API, tts_server.py (to buffer the tts engine stream then send into RVC and then relaying back out the RVC stream to the end user). and of course, that would require RVC to be running in realtime mode, which does exist/is possible, but I havnt ever looked at or touched that code, so I dont know if there are changes required there.

I guess the answer to your question is, its something Ive not had time to think about or look at.

Paladinium Oct 16, 2024

Thanks for your thorough response! I appreciate your work and it is obvious that not everything can be done (for technical reasons or time constraints).

For usages of alltalk only relying on the API, it would be nice to support real-time voice cloning of multiple voices enhanced with RVC. Maybe this is something to put on a wishlist ;-)

Anyway, looking forward how RVC is going to evolve and thanks for your work!

Paladinium Oct 17, 2024

@erew123 : Would you be open for a contribution that tries to improve conversion time when using various RVC voices (e.g. when providing an API) by using caching?

erew123 · 2024-05-29T09:44:52Z

erew123
May 29, 2024
Maintainer Author

I realised that some of my updates on here show as hidden/"Show 3 previous replies" so if you miss them, they are here.

1 reply

erew123 May 29, 2024
Maintainer Author

Lots of code tidy up going on at the moment. Managed to get the requirements files for both Windows & Linux working (at least for Standalone installs), Ill need to test installs a good few times on both OS's. If I can get this code clean-up done, a bit of extra documentation and a few additional bits (automated deepspeed builds on Linux is on my list), I should be able to get a beta out soon, though there will be some features not yet available..

erew123 · 2024-05-30T18:58:05Z

erew123
May 30, 2024
Maintainer Author

DeepSpeed for Linux - Finally simple installs

https://github.com/erew123/alltalk_tts/releases/tag/DeepSpeed-14.2-Linux

Will be able to automate the standalone version of DeepSpeed on Linux now.

0 replies

erew123 · 2024-05-30T23:19:02Z

erew123
May 30, 2024
Maintainer Author

Voice 2 RVC pipeline

AKA record your own voice and make it sound like someone else (no idea how good this will be, but its there)

0 replies

erew123 · 2024-05-31T23:43:16Z

erew123
May 31, 2024
Maintainer Author

Working on code documentation and tidy up

Spent quite a bit of time on more code cleanup. The below is an example of part of one of the 4x TTS engine scripts. They've been cleaned up to show examples of how to import a new TTS engine, so that in theory, other people can use them as examples to import other engines in future.

Left to do, before I throw a beta out:

Done Re-jig a few bits of code in finetune (as Im guessing people may want to use it)
Done Make a few things clearer in the diagnostics.
Done Must remember to change the Gradio Temp folder path.
Done Document the new API.
Done Clean up some other bits of built in documentation
Done Update the Text-gen-webui code to include RVC, update SillyTavern extension too.
Possibly make RVC training, though I may do that after the first BETA.
Check if Finetuning works with the CUDA toolkit that can be auto installed.
Set the default settings in config files
Have a clean around the folders and test install on Windows & Linux again.

Later on, after the BETA is out, Ill be dealing with bugs maybe and looking to setup the modular loader aspect of AllTalk aka do/dont load finetuning as part of the interface or other such aspects. Then also hitting on the other aspects of Feature requests. Though a day off may be kind of nice!

0 replies

erew123 · 2024-06-01T14:23:18Z

erew123
Jun 1, 2024
Maintainer Author

XTTS Finetuning

Has been given some spit and polish (mostly spit). A couple of updated features. No more being restricted to just the 2x models to use for training. Choose your own folder for output w/ protection over-write. Set the maximum duration of audio files that Whisper can generate when making training data. A few bits tidied around the interface etc.

0 replies

erew123 · 2024-06-03T21:27:50Z

erew123
Jun 3, 2024
Maintainer Author

Final steps...

Possibly make RVC training, though I may do that after the first BETA. (Will do this after BETA is out)
Add a few minor updates to TGWUI local user interface.
Check if Finetuning works with the CUDA toolkit that can be auto installed.
Set the default settings in config files
Have a clean around the folders and test install on Windows & Linux again. (this will take a while)

5 replies

erew123 Jun 3, 2024
Maintainer Author

Just as I go to start testing.. Nvidia's developer portal goes down...groan! Cant complete installation tests until thats back up.

erew123 Jun 5, 2024
Maintainer Author

Just FYI to Anwe*******ler (cutting out a bit of your name there for umm well, no doxxing). I think github may have deleted your message! It wasn't me at least! I did see it though. Maybe because you put a swear word when talking about how frustrating getting requirements to work are. No idea if Gihub does that, but its all I can think of.

That aside.. Im just kicking out a few bugs and testing.

Standalone installs are working generally now, but Im having a few issues with Text-gen-webui bits. Assuming I can get them fixed, Ill have to quickly re-test standalone installs work and we should be up and running for a beta.

Anwendungsfehler Jun 5, 2024

Just FYI to Anwe*******ler (cutting out a bit of your name there for umm well, no doxxing). I think github may have deleted your message! It wasn't me at least! I did see it though. Maybe because you put a swear word when talking about how frustrating getting requirements to work are. No idea if Gihub does that, but its all I can think of.

That aside.. Im just kicking out a few bugs and testing.

Standalone installs are working generally now, but Im having a few issues with Text-gen-webui bits. Assuming I can get them fixed, Ill have to quickly re-test standalone installs work and we should be up and running for a beta.

Nope, it was me, myself and I- by accident! - I wanted to rewrite some parts - but overall - thank you so much for your work, it is awesome! There were no swearwords, I swear, hehe - if you need some more beta testers, I'm here :)

RenNagasaki Jun 5, 2024

I'm so hooked for the beta. I built a whole Mod around your TTS for a game and cant wait to try and add more tts engines. 😍

zd528 Jun 6, 2024

Would be very interested as well. Amazing work!

erew123 · 2024-06-06T22:39:59Z

erew123
Jun 6, 2024
Maintainer Author

Everyone..... I will probably not be commenting on here about the BETA any more...... :(

Because I have finally managed to get the BETA up and available :)

Please see AllTalk v2 BETA Download Details & Discussion

0 replies

CRCODE22 · 2024-10-17T08:58:40Z

CRCODE22
Oct 17, 2024

Hi @erew123,

This is a really fun and good new tts engine can it be added to Alltalk?

https://github.com/SWivid/F5-TTS

Elevenlabs-level TTS on your laptop.

I'm always skeptical about new AI models hyped up sounding too good to be true. But this...is crazy good.

And now, you can run the gradio app by @realmrfakename on your laptop with 1 click.

Meet @elonmusk, from Silicon Valley. https://t.co/wyuP5BD7pe pic.twitter.com/b9ndh5oVZd
— cocktail peanut (@cocktailpeanut) October 13, 2024

0 replies

AllTalk version 2 progress. #211

erew123 May 9, 2024 Maintainer

Replies: 12 comments · 19 replies

erew123 May 16, 2024 Maintainer Author

erew123 May 17, 2024 Maintainer Author

erew123 May 17, 2024 Maintainer Author

Managed to add in silent narrator and also silent text-not-inside options

Generate Screen

Basic settings

API Settings

TGWUI Settings

TTS Engine settings example

TTS Model information example

TGWUI Remote extension

erew123 May 28, 2024 Maintainer Author

Great News & an RVC demo

erew123 Oct 16, 2024 Maintainer Author

erew123 May 29, 2024 Maintainer Author

erew123 May 29, 2024 Maintainer Author

erew123 May 30, 2024 Maintainer Author

DeepSpeed for Linux - Finally simple installs

erew123 May 30, 2024 Maintainer Author

Voice 2 RVC pipeline

erew123 May 31, 2024 Maintainer Author

Working on code documentation and tidy up

erew123 Jun 1, 2024 Maintainer Author

XTTS Finetuning

erew123 Jun 3, 2024 Maintainer Author

erew123 Jun 3, 2024 Maintainer Author

erew123 Jun 5, 2024 Maintainer Author

erew123 Jun 6, 2024 Maintainer Author

erew123
May 9, 2024
Maintainer

Replies: 12 comments 19 replies

erew123
May 16, 2024
Maintainer Author

erew123 May 17, 2024
Maintainer Author

erew123
May 17, 2024
Maintainer Author

erew123 May 28, 2024
Maintainer Author

erew123 Oct 16, 2024
Maintainer Author

erew123
May 29, 2024
Maintainer Author

erew123 May 29, 2024
Maintainer Author

erew123
May 30, 2024
Maintainer Author

erew123
May 30, 2024
Maintainer Author

erew123
May 31, 2024
Maintainer Author

erew123
Jun 1, 2024
Maintainer Author

erew123
Jun 3, 2024
Maintainer Author

erew123 Jun 3, 2024
Maintainer Author

erew123 Jun 5, 2024
Maintainer Author

erew123
Jun 6, 2024
Maintainer Author