Table of Contents
The project scrapes the data from PoshanTracker website to create a parallel dataset of English-Indic language format. The Indic language data is then translated to English using two different models - indicTrans and helsinki-nlp/opus-mt-mul-en. Finally, the BLEU scores and CHRF scores are calculated.
Here is a list of all webpages scraped:
Assuming you have pip, python, pandas and numpy installed,
- Clone the repository
git clone https://github.com/Srija616/translationTask.git
- Install the requirements:
pip install -r requirements.txt
- For task 1, 2 and 3, indic-nlp library and indicTrans translation model is required, so run the script.sh file
bash script.sh
Note: After cloning indicTrans, please add init.py file in the indicTrans directory. Also change the import statement in indicTrans\inference\engine.py: from inference.custom_interactive import Translator
to from indicTrans.inference.custom_interactive import Translator
After installing the requirements, run the web_scraper.py file. The web_scraper.py module extracts data from PoshanTracker and its internally linked webpages.
The extracted data is divided into three sets:
- raw_data - it consists of bilingual csv files in the format en-indiclang. raw_data is cleaned manually (or with a small script - align.py) to align the data. Using cleaning.py, the data is finally processed in Task 2 to remove duplicates and data from other languages (especially separating English data from Indic data)
Languages: ['bn', 'gu', 'hi', 'kn', 'ml', 'mr', 'or', 'pa', 'ta', 'te'] i.e. [Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Panjabi, Tamil, Telugu] - unsupported_clean: For languages that are not supported by fasttext-langdetect, monolingual data is cleaned to remove duplicates, numeric text and text with less than 2 characters and added here.
Languages: ['as', 'ne', 'doi', 'kok', 'sd', 'brx', 'mai', 'mni', 'sat', 'ur', 'ks'] i.e. [Assamese, Nepali, Dogri, Konkani, Sindhi, Bodo, Maithili, Manipuri, Santhali, Urdu, Kashmiri] - supported_clean: For languages supported by fasttext-langdetect and also in the common_supported languages (i.e. translation is supported by both indicTrans and Helsinki), monolingual data is cleaned to remove other languages, duplicates, numeric text and text with less than 2 characters and added here.
Languages: ['bn', 'gu', 'hi', 'kn', 'ml', 'mr', 'or', 'pa', 'ta', 'te', 'en'] i.e. [Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Panjabi, Tamil, Telugu, English]
To generalize the process of scraping the data, change the function get_language_driver in web_scraper.py according to your website. Also, in the config.py module, change the list variable languages and langugae_codes according to your website. You may include only one language as well. In the repository I have added the result for scraping www.google.com in the scrape_external.csv file. If you do not wish to make call to get_language_driver, comment out the following code block in get_all_data function.
try:
driver = get_language_driver(lang, driver, url)
except:
print (f"Unable to get a language driver for {language_isocode}, possibly you have sent too many get requests in a short time and getting HTTP 429 error")
return []
Input directory: raw_data (It contains the output of Task 1 which is in turn processed semi-automated using align.py)
Output directory: raw_data_new
After aligning the text in raw_data with the help of align.py, clean the text using cleaning.py
For testing purposes, use raw_data_new as the input directory as well. cleaning.py performs the following tasks:
- Preprocessing - Remove duplicates, remove data from other languages, remove data that is numeric or of length less than 2 characters.
- Language detection - Done for all languages irrespective of support with the fasttext-langdetect library.
NOTE: As evident, low resource languages that share script or word similarities with a high resource language, example, Maithili and Sindhi with Hindi or Konkani with Marathi, Manipuri with Bengali are detected as their respective high-resource language. It is important to note that the library officially supports Sindhi and Maithili, yet fail to detect them correctly, even though complete data was provided for detection. Also, unsupported languages like Dogri, Bodo and Santhali are detected as English.
Language - original | Language - Detected |
---|---|
Bengali | Bengali |
Gujarati | Gujarati |
Hindi | Hindi |
Kannada | Kannada |
Malayalam | Malayalam |
Marathi | Marathi |
Odia | Odia |
Panjabi | Panjabi |
Tamil | Tamil |
Telugu | Telugu |
Assamese | Assamese |
Bodo | English |
Dogri | English |
Konkani | Marathi |
Maithili | Hindi |
Manipuri | Bengali |
Nepali | Nepali |
Santhali | English |
Sindhi | Hindi |
Urdu | Urdu |
Kashmiri | Urdu |
Run the code indicTrans.py and the output is stored in directory translated_indicTrans
Run the code helsinki.py and the output is stored in directory translated_helsinki
Run calculate_metrics.py to get the CHRF and BLEU scores.
No. | language_code | bleu_indicTrans | chrf_indicTrans | bleu_helsinki | chrf_helsinki |
---|---|---|---|---|---|
1 | bn | 40.361451397114195 | 67.29100086720587 | 15.008718294521943 | 38.459893968335855 |
2 | gu | 40.09242647814781 | 62.38047934152201 | 17.519435117029424 | 41.67385242030943 |
3 | hi | 35.46963979966534 | 61.12925209053163 | 11.225427865557949 | 38.59222728367872 |
4 | kn | 32.147355804677694 | 56.59004522220443 | 15.618151502462212 | 38.06799170855191 |
5 | ml | 37.8225649633747 | 61.7368338997521 | 15.2814778557767 | 39.55520687508669 |
6 | mr | 36.618912232159545 | 61.060112673233135 | 15.732443365648534 | 40.030517925746025 |
7 | or | 40.61712977226381 | 64.53967600298299 | 11.310679447675124 | 36.57590226441299 |
8 | pa | 33.19827639988097 | 56.28109096317498 | 16.424478002794018 | 40.2022064113979 |
9 | ta | 32.618353590741094 | 55.570154385196666 | 14.990324396869909 | 38.452540317035634 |
10 | te | 39.97781730506786 | 62.885759163173674 | 20.18320840530373 | 43.50483959356902 |
My observations on the translations:
- The translations by indicTrans are far better for most of the Indic languages in comparison to helsinki-nlp/opus-mt-mul-en
- helsinki-nlp/opus-mt-mul-en produces garbage text for certain languages like Bengali and Odia. Eg: "State Helpdesk" in Bengali is translated to "The Kingdom Hall of Jehovah’s Witnesses"
- Some words don't seem to be a part of the vocabulary, however indicTrans does a better job at Transliteration when it cannot translate, compared to Helsinki model. However, I also noticed that words like "Poshan" is translated into different words like "Posian" or "Posan" or "animal" [Refer Bengali translations].
- indicTrans adds certain character like %s or (s) before or after translations. While the translation itself is correct, but this could affect the Bleu and CHRF scores.
- There is a difference in the amount of data for each language which can affect the scores. This was due to differences during extraction of data from website.
- While absolute scores may not paint a correct picture, comparison of the scores for Helsinki and indicTrans is in sync with my own obeservations of data. IndicTrans is significantly better than Helsinki-NLP/opus-mt-mul-en.
Observations for the project
- While cleaning data for each language was not difficult however ensuring that all data gets extracted from the website and aligning it with the English text was challenging and required looking at each csv individually. After manually cleaning a number of languages, I could see patterns and guesstimate the indices and text around which the alignment could have gone wrong. I have not delved deeper into why this was particularly happening in some of the languages (since the errors were not consistent across the languages), looking at how the website adds these data could help.