Skip to content

Commit

Permalink
Removed lorem and added content
Browse files Browse the repository at this point in the history
  • Loading branch information
imradhe committed Sep 13, 2024
1 parent 3f9b0b9 commit 72f9069
Showing 1 changed file with 1 addition and 12 deletions.
13 changes: 1 addition & 12 deletions frontend/components/Datasets.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -126,18 +126,7 @@ export default function Datasets() {
</Heading>

<Text color={"black"}>
Lorem ipsum dolor sit amet consectetur adipisicing elit. Ex facere,
eaque accusamus nostrum quo voluptates omnis beatae soluta sequi!
Magni, iste excepturi. Repudiandae vero dolore sapiente veritatis
minus, accusantium praesentium non voluptatibus, vitae et pariatur
beatae, consectetur qui ducimus. Sunt, temporibus quidem, recusandae
facere necessitatibus expedita saepe quos eos vitae harum reiciendis
sequi id aspernatur itaque voluptatum optio delectus vero, voluptas
numquam rerum corrupti cupiditate. Error libero beatae excepturi
maxime quia velit blanditiis ea officiis, repellendus et vero
aspernatur totam amet! Laborum, sunt accusantium. Tenetur expedita
cum numquam, veritatis exercitationem eos voluptates a autem
molestias possimus aperiam repellendus quaerat doloribus.
Early on in our journey, we recognized that advancing Indian technology necessitates large-scale datasets. Thus, building and collecting extensive datasets across multiple verticals has become a critical endeavor at AI4Bharat. Thanks to generous grants from MeitY, we are spearheading pioneering efforts in data collection as part of the Data Management Unit of Bhashini. Our nationwide initiative aims to gather 15,000 hours of transcribed data from over 400 districts, encompassing all 22 scheduled languages of India. In parallel, our in-house team of over 100 translators is diligently creating a parallel corpus with 2.2 million translation pairs across 22 languages. To produce studio-quality data for expressive TTS systems, we have established recording studios in our lab, where professional voice artists contribute their expertise. Additionally, our annotators are meticulously labeling pages for Document Layout Parsing, accommodating the diverse scripts of India. To accelerate the development of Indic Large Language Models (LLMs), we are focused on building pipelines for curating and synthetically generating pre-training data, collecting contextually grounded prompts, and creating evaluation datasets that reflect India’s rich linguistic tapestry. Collecting and annotating data at this scale demands standardization of processes and tools. To meet this challenge, AI4Bharat has invested in developing various open-source data collection and annotation tools, aiming to enhance these efforts not only within India but also in multilingual regions across the globe.
</Text>
</Box>
</Flex>
Expand Down

0 comments on commit 72f9069

Please sign in to comment.