-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Aravinth Bheemaraj edited this page Oct 23, 2020
·
2 revisions
This repository contains parallel language corpus links for popular Indian languages developed as part of the Anuvaad project.
Please reach out to [email protected] for any clarification/interpretation/usage of the linked datasets.
The current status of the parallel corpus built* :
Language Pair | Parallel Corpus Count |
---|---|
English-Hindi | 228,631 |
This dataset is growing everyday!
The goal is to build high quality parallel corpus for the Indian languages across various domains (Judicial, Educational, Medical, News etc). This can be eventually used to train the ML models based on the use cases.
Read more about Anuvaad @ http://anuvaad.org/
The code for building the below mentioned datasets are available under https://github.com/project-anuvaad/anuvaad-corpus-tools
PIB (2016-2020) - Created from the parallel reports available in PIB site
Year | En-Hi pairs count |
---|---|
2020 | 65,149 |
2019 | 41,695 |
2018 | 50,628 |
2017 | 32,113 |
2016 | 39,046 |
Visit http://anuvaad.org