- Make sure you have a Python (e.g. conda) environment with the packages from
requirements.txt
. - Run the
download_parallel_corpora.py
script (find the instructions below). It will download all primary data + NLLB Seed (but will not download the 3 datasets mentioned below which require logging in, etc.) and prepare them so that they satisfy a common directory/file structure (see the header of thedownload_parallel_corpora.py
script for more info). - (Optional - advanced) If you wish to analyze the data for each of the languages in the primary dataset check out the
analyse_data.py
script. - (Optional - advanced) Check out some notes we compiled by manually analyzing the datasets available for download from our
download_parallel_corpora.py
script here.
-
Similarly to above just use the download_MaCoCu function from the
download_parallel_corpora.py
script and comment out everything else. -
Also run
download_opus.py
script to get the OPUS data.
You can find more information down below.
Slavic languages supported by NLLB (and the script that is supported):
- Belarusian (Cyrl)
- Bosnian (Latn)
- Bulgarian (Cyrl)
- Czech (Latin)
- Croatian (Latn)
- Macedonian (Cyrl)
- Polish (Latn)
- Silesian (Latn)
- Russian (Cyrl)
- Slovak (Latn)
- Slovenian aka Slovene (Latn)
- Serbian (Cyrl)
- Ukrainian (Cyrl)
Baltic:
- Lithuanian (Latn)
- Latvian (Latn)
- Latgalian (Latn)
The only baltic language that is not supported is: Samogitian (~400k people speak it)
The script download_parallel_corpora.py
is provided for convenience to automate download of many publicly available sources of MT data
which were used to train NLLB models. You should provide a parent directory into which to save the data. Usage is as follows:
python download_parallel_corpora.py --directory $DOWNLOAD_DIRECTORY
Note that there are a number of other adhoc datasets for which we are not able to automate this process because they require an account or login of some kind:
- Chichewa News (https://zenodo.org/record/4315018#.YaTuk_HMJDY)
- GELR (Ewe-Eng) (https://www.kaggle.com/yvicherita/ewe-language-corpus)
- Lorelei (https://catalog.ldc.upenn.edu/LDC2021T02)
Important note on JW300 (described in https://aclanthology.org/P19-1310/): at the time of final publication, the JW300 corpus was no longer publicly available for MT training because of licensing isses with the Jehovah's Witnesses organization, though it had already been used for the NLLB project. We hope that it may soon be made available again.
NLLB-Seed datasets are included along with above Public datasets to create our Primary dataset. NLLB-Seed data can be downloaded from here.
LASER3 encoders and mined bitext metadata are open sourced in LASER repository. Global mining pipeline and monolingual data filtering pipelines are released and available in our stopes repository.
A helper script to perform backtranslation can be found in examples/nllb/modeling/scripts/backtranslation/generate_backtranslations.sh
. It will take a corpus that’s been binarized using stopes
prepare_data
pipeline and backtranslate all its shards. Please check the backtranslation README file for further guidance on how to run this helper script.
Data that has been backtranslated will then need to be extracted into a parallel corpus. The script examples/nllb/modeling/scripts/backtranslation/extract_fairseq_bt.py
automates this task. Further information can be found in the README above.
Once backtranslated data has been extracted, it can be treated as any other bitext corpus. Please follow the instructions for data filtering and preparation below.
Data preparation is fully managed by the stopes
pipelines. Specifically:
- Data filtering is performed using
stopes
filtering
pipeline. Please check the corresponding README file and example configuration for more details. - Once filtered, data can then be preprocessed/binarized with
stopes
prepare_data
pipeline.
Encoding the datasets are done using the new SPM-200
model which was trained on 200+ languages used in the NLLB project. For more details see link.
SPM-200 Artifacts | download links |
---|---|
Model | link |
Dictionary | link |