Skip to content

Commit

Permalink
Modified Area Info Paras and Added Recorder Package
Browse files Browse the repository at this point in the history
  • Loading branch information
Shanks0465 committed Aug 28, 2024
1 parent 5d78d1a commit 659e670
Show file tree
Hide file tree
Showing 5 changed files with 151 additions and 8 deletions.
32 changes: 29 additions & 3 deletions frontend/components/Datasets.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ import {
SkeletonCircle,
SkeletonText,
Link,
Image as ChakraImage,
} from "@chakra-ui/react";
import Image from "next/image";
import axios from "axios";
Expand Down Expand Up @@ -127,19 +128,44 @@ export default function Datasets() {
>
<Box
position={"relative"}
height={"300px"}
rounded={"2xl"}
boxShadow={"2xl"}
width={"full"}
overflow={"hidden"}
>
<Image
<ChakraImage
alt={"Hero Image"}
fill
src={`${imagePrefix}/assets/data-collection.png`}
/>
</Box>
</Flex>
<Text>
Early on in our journey, we recognized that advancing Indian
technology necessitates large-scale datasets. Thus, building and
collecting extensive datasets across multiple verticals has become a
critical endeavor at AI4Bharat. Thanks to generous grants from
MeitY, we are spearheading pioneering efforts in data collection as
part of the Data Management Unit of Bhashini. Our nationwide
initiative aims to gather 15,000 hours of transcribed data from over
400 districts, encompassing all 22 scheduled languages of India. In
parallel, our in-house team of over 100 translators is diligently
creating a parallel corpus with 2.2 million translation pairs across
22 languages. To produce studio-quality data for expressive TTS
systems, we have established recording studios in our lab, where
professional voice artists contribute their expertise. Additionally,
our annotators are meticulously labeling pages for Document Layout
Parsing, accommodating the diverse scripts of India. To accelerate
the development of Indic Large Language Models (LLMs), we are
focused on building pipelines for curating and synthetically
generating pre-training data, collecting contextually grounded
prompts, and creating evaluation datasets that reflect India’s rich
linguistic tapestry. Collecting and annotating data at this scale
demands standardization of processes and tools. To meet this
challenge, AI4Bharat has invested in developing various open-source
data collection and annotation tools, aiming to enhance these
efforts not only within India but also in multilingual regions
across the globe.
</Text>
</Stack>
{isLoading ? (
<SimpleGrid columns={{ base: 1, md: 3 }} spacing={10}>
Expand Down
17 changes: 16 additions & 1 deletion frontend/components/Dynamic/Area.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ const areaInfo: { [key: string]: { title: string; description: string } } = {
nmt: {
title: "Machine Translation",
description:
"AI4Bharat is a pioneering initiative focused on building open-source AI solutions that address challenges unique to India. One of their significant contributions is in the field of machine translation, where they aim to bridge the linguistic diversity of the country. AI4Bharat has developed state-of-the-art models that facilitate the translation of text between Indian languages, enabling seamless communication across different linguistic communities. Their work includes creating large-scale datasets, fine-tuning models for regional languages, and ensuring these tools are accessible to developers and researchers. This initiative not only promotes inclusivity but also helps preserve the rich linguistic heritage of India by making digital content available in multiple languages.",
"Our machine translation models, including IndicTransv2, are built on large-scale datasets mined from the web and carefully curated human translations, catering to all 22 Indian languages and competing with commercial models as validated on multiple benchmarks.",
},
llm: {
title: "Large Language Models",
Expand All @@ -40,6 +40,21 @@ const areaInfo: { [key: string]: { title: string; description: string } } = {
models, while ensuring diversity in their generation capabilities, thereby advancing the frontier of
language technology for India’s diverse linguistic landscape.`,
},
asr: {
title: "Automatic Speech Recognition",
description:
"Our ASR models, including IndicWav2Vec and IndicWhisper, are trained on rich datasets like Kathbath, Shrutilipi and IndicVoices, covering multiple Indian languages.",
},
tts: {
title: "Speech Synthesis",
description:
"AI4Bharat’s TTS efforts, exemplified by AI4BTTS, focus on creating natural-sounding synthetic voices for Indian languages using a mix of web-crawled data and carefully curated datasets like Rasa.",
},
xlit: {
title: "Transliteration",
description:
"AI4Bharat’s transliteration models, like IndicXlit, are optimized for converting text between scripts of Indian languages and English, leveraging large scale datasets such as Aksharantar",
},
};

const fetchAreaData = async (slug: string) => {
Expand Down
8 changes: 4 additions & 4 deletions frontend/components/Features.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ export default function Features() {
/>
}
description={
"AI4Bharat has pioneered the development of multilingual LLMs tailored for Indian languages, such as IndicBERT, IndicBART, and Airavata trained on extensive, diverse datasets like IndicCorpora and Sangraha."
"Our machine translation models, including IndicTransv2, are built on large-scale datasets mined from the web and carefully curated human translations, catering to all 22 Indian languages and competing with commercial models as validated on multiple benchmarks."
}
href={`${imagePrefix}/areas/nmt`}
/>
Expand All @@ -118,7 +118,7 @@ export default function Features() {
/>
}
description={
"AI4Bharat has pioneered the development of multilingual LLMs tailored for Indian languages, such as IndicBERT, IndicBART, and Airavata trained on extensive, diverse datasets like IndicCorpora and Sangraha."
"AI4Bharat’s transliteration models, like IndicXlit, are optimized for converting text between scripts of Indian languages and English, leveraging large scale datasets such as Aksharantar"
}
href={`${imagePrefix}/areas/xlit`}
/>
Expand All @@ -133,7 +133,7 @@ export default function Features() {
/>
}
description={
"AI4Bharat has pioneered the development of multilingual LLMs tailored for Indian languages, such as IndicBERT, IndicBART, and Airavata trained on extensive, diverse datasets like IndicCorpora and Sangraha."
"Our ASR models, including IndicWav2Vec and IndicWhisper, are trained on rich datasets like Kathbath, Shrutilipi and IndicVoices, covering multiple Indian languages."
}
href={`${imagePrefix}/areas/asr`}
/>
Expand All @@ -148,7 +148,7 @@ export default function Features() {
/>
}
description={
"AI4Bharat has pioneered the development of multilingual LLMs tailored for Indian languages, such as IndicBERT, IndicBART, and Airavata trained on extensive, diverse datasets like IndicCorpora and Sangraha."
"AI4Bharat’s TTS efforts, exemplified by AI4BTTS, focus on creating natural-sounding synthetic voices for Indian languages using a mix of web-crawled data and carefully curated datasets like Rasa."
}
href={`${imagePrefix}/areas/tts`}
/>
Expand Down
101 changes: 101 additions & 0 deletions frontend/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions frontend/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
"markdown-to-jsx": "^7.5.0",
"next": "14.2.5",
"react": "^18",
"react-audio-voice-recorder": "^2.2.0",
"react-dom": "^18",
"react-icons": "^5.3.0",
"react-markdown": "^9.0.1",
Expand Down

0 comments on commit 659e670

Please sign in to comment.