From a8059a4ea6e9397f878237ae0ceec7f13a913491 Mon Sep 17 00:00:00 2001
From: S V Praveen <43694567+svp19@users.noreply.github.com>
Date: Thu, 13 Jun 2024 16:04:50 +0530
Subject: [PATCH] Update README.md

---
 README.md | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/README.md b/README.md
index a1ae672..57b21ad 100644
--- a/README.md
+++ b/README.md
@@ -1,17 +1,16 @@
-# IndicVoices-R
-Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
+# IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
 
+### Abstract
 Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations collected in low-quality environments to generate high-quality TTS training data. Our pipeline leverages the cross-lingual generalization of denoising and speech enhancement models trained on English and applied to Indian languages. This results in IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset derived from an ASR dataset, with 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. IV-R matches the quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS. We also introduce the IV-R Benchmark, the first to assess zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices, ensuring diversity in age, gender, and style. We demonstrate that fine-tuning an English pre-trained model on a combined dataset of high-quality IndicTTS and our IV-R dataset results in better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone. Further, our evaluation reveals limited zero-shot generalization for Indian voices in TTS models trained on prior datasets, which we improve by fine-tuning the model on our data containing diverse set of speakers across language families. We open-source all data and code, releasing the first TTS model for all 22 official Indian languages.
 
 
-## Resources
+### Resources
 
 Download the data [here](https://ai4bharat.iitm.ac.in/indicvoices_r/)
 
 ### Manifest Format
 
-```
-{
+````
     "filename": "<AUDIOS/audios>/2533274790514854_chunk_4.wav",                          # Points to the wav file
     "text": "<TRANSCRIPT>",                   # Transcript for audio, we use Normalized version of the transcript
     "duration": <DURATION>,                                                          #  Audio duration in seconds
@@ -37,7 +36,7 @@ Download the data [here](https://ai4bharat.iitm.ac.in/indicvoices_r/)
     "utterance_pitch_mean": xx.xx,
     "utterance_pitch_std": xx.xx,
     "cer": 0.xx,
-```
+````
 
 ### LICENSE