Skip to content

Latest commit

 

History

History
116 lines (80 loc) · 5.19 KB

TTS_Notes.md

File metadata and controls

116 lines (80 loc) · 5.19 KB

Notes from TTS Experimentation

For the TTS Pipeline, all of the top models from HuggingFace and Reddit were tried.

The goal was to use the models that were easy to setup and sounded less robotic with ability to include sound effects like laughter, etc.

Parler-TTS

Minimal code to run their models:

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

# Define text and description
text_prompt = "This is where the actual words to be spoken go"
description = """
Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
"""

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

ipd.Audio(audio_arr, rate=model.config.sampling_rate)

The really cool aspect of these models are the ability to prompt the description which can change the speaker profile and pacing of the outputs.

Surprisingly, Parler's mini model sounded more natural.

In their repo they share names of speakers that we can use in prompt.

Suno/Bark

Minimal code to run bark:

voice_preset = "v2/en_speaker_6"
sampling_rate = 24000

text_prompt = """
Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
inputs = processor(text_prompt, voice_preset=voice_preset).to(device)

speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

Similar to parler models, suno has a library of speakers.

v9 from their library sounded robotic so we use Parler for our first speaker and the best one from bark.

The incredible thing about Bark model is being able to add sound effects: [Laugh], [Gasps], [Sigh], [clears throat], making words capital causes the model to emphasize them.

Adding - gives a break in the text. We utilize this knowledge when we re-write the transcript using the 8B model to add effects to our transcript.

Note: Authors suggest using .... However, this didn't work as effectively as adding a hyphen during trails.

Hyper-parameters:

Bark models have two parameters we can tweak: temperature and semantic_temperature

Below are the notes from a sweep, prompt and speaker were fixed and this was a vibe test to see which gives best results. temperature and semantic_temperature respectively below:

First, fix temperature and sweep semantic_temperature

  • 0.7, 0.2: Quite bland and boring
  • 0.7, 0.3: An improvement over the previous one
  • 0.7, 0.4: Further improvement
  • 0.7, 0.5: This one didn't work
  • 0.7, 0.6: So-So, didn't stand out
  • 0.7, 0.7: The best so far
  • 0.7, 0.8: Further improvement
  • 0.7, 0.9: Mix feelings on this one

Now sweeping the temperature

  • 0.1, 0.9: Very Robotic
  • 0.2, 0.9: Less Robotic but not convincing
  • 0.3, 0.9: Slight improvement still not fun
  • 0.4, 0.9: Still has a robotic tinge
  • 0.5, 0.9: The laugh was weird on this one but the voice modulates so much it feels speaker is changing
  • 0.6, 0.9: Most consistent voice but has a robotic after-taste
  • 0.7, 0.9: Very robotic and laugh was weird
  • 0.8, 0.9: Completely ignore the laughter but it was more natural
  • 0.9, 0.9: We have a winner probably

After this about ~30 more sweeps were done with the promising combinations:

Best results are at speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8) Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

Notes from other models that were tested:

Promising directions to explore in future:

  • MeloTTS This is most popular (ever) on HuggingFace
  • WhisperSpeech sounded quite natural as well
  • F5-TTS was the latest release at this time, however, it felt a bit robotic
  • E2-TTS: r/locallama claims this to be a little better, however, it didn't pass the vibe test
  • xTTS It has great documentation and also seems promising

Some more models that weren't tested:

In other words, we leave this as an exercise to readers :D