From 72f90698fed7e3765106a1c2b8c1587332eaca8d Mon Sep 17 00:00:00 2001 From: imradhe Date: Fri, 13 Sep 2024 20:00:54 +0530 Subject: [PATCH] Removed lorem and added content --- frontend/components/Datasets.tsx | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) diff --git a/frontend/components/Datasets.tsx b/frontend/components/Datasets.tsx index c9f448f..d19eb4f 100644 --- a/frontend/components/Datasets.tsx +++ b/frontend/components/Datasets.tsx @@ -126,18 +126,7 @@ export default function Datasets() { - Lorem ipsum dolor sit amet consectetur adipisicing elit. Ex facere, - eaque accusamus nostrum quo voluptates omnis beatae soluta sequi! - Magni, iste excepturi. Repudiandae vero dolore sapiente veritatis - minus, accusantium praesentium non voluptatibus, vitae et pariatur - beatae, consectetur qui ducimus. Sunt, temporibus quidem, recusandae - facere necessitatibus expedita saepe quos eos vitae harum reiciendis - sequi id aspernatur itaque voluptatum optio delectus vero, voluptas - numquam rerum corrupti cupiditate. Error libero beatae excepturi - maxime quia velit blanditiis ea officiis, repellendus et vero - aspernatur totam amet! Laborum, sunt accusantium. Tenetur expedita - cum numquam, veritatis exercitationem eos voluptates a autem - molestias possimus aperiam repellendus quaerat doloribus. + Early on in our journey, we recognized that advancing Indian technology necessitates large-scale datasets. Thus, building and collecting extensive datasets across multiple verticals has become a critical endeavor at AI4Bharat. Thanks to generous grants from MeitY, we are spearheading pioneering efforts in data collection as part of the Data Management Unit of Bhashini. Our nationwide initiative aims to gather 15,000 hours of transcribed data from over 400 districts, encompassing all 22 scheduled languages of India. In parallel, our in-house team of over 100 translators is diligently creating a parallel corpus with 2.2 million translation pairs across 22 languages. To produce studio-quality data for expressive TTS systems, we have established recording studios in our lab, where professional voice artists contribute their expertise. Additionally, our annotators are meticulously labeling pages for Document Layout Parsing, accommodating the diverse scripts of India. To accelerate the development of Indic Large Language Models (LLMs), we are focused on building pipelines for curating and synthetically generating pre-training data, collecting contextually grounded prompts, and creating evaluation datasets that reflect India’s rich linguistic tapestry. Collecting and annotating data at this scale demands standardization of processes and tools. To meet this challenge, AI4Bharat has invested in developing various open-source data collection and annotation tools, aiming to enhance these efforts not only within India but also in multilingual regions across the globe.