NLLB LASER3 and xsim release

facebookresearch · Jul 6, 2022 · 8d66f0e · 8d66f0e
1 parent f4689ad
commit 8d66f0e
Show file tree

Hide file tree

Showing 12 changed files with 1,053 additions and 131 deletions.
diff --git a/README.md b/README.md
@@ -3,6 +3,8 @@
 LASER is a library to calculate and use multilingual sentence embeddings.
 
 **NEWS**
+* 2022/07/06 Updated LASER models with support for over 200 languages are [**now available**](nllb/README.md)
+* 2022/07/06 Multilingual similarity search (**xsim**) evaluation pipeline [**released**](tasks/xsim/README.md)
 * 2022/05/03 [**Librivox S2S is available**](tasks/librivox-s2s): Speech-to-Speech translations automatically mined in Librivox [9]
 * 2019/11/08 [**CCMatrix is available**](tasks/CCMatrix): Mining billions of high-quality parallel sentences on the WEB [8]
 * 2019/07/31 Gilles Bodard and Jérémy Rapin provided a [**Docker environment**](docker) to use LASER
@@ -11,24 +13,17 @@ LASER is a library to calculate and use multilingual sentence embeddings.
 * 2019/02/13 The code to perform bitext mining is [**now available**](tasks/bucc)
 
 **CURRENT VERSION:**
-* We now provide an encoder which was trained on [**93 languages**](#supported-languages), written in 23 different alphabets [6].
-  This includes all European languages, many Asian and Indian languages, Arabic, Persian, Hebrew, ...,
-  as well as various minority languages and dialects.
-* We provide a [*test set for more than 100 languages*](data/tatoeba/v1)
-  based on the [*Tatoeba corpus*](https://tatoeba.org/eng).
-* Switch to PyTorch 1.0
-
-All these languages are encoded by the same BiLSTM encoder, and there is no need
-to specify the input language (but tokenization is language specific).
+* We now provide updated LASER models which support over 200 languages. Please see [here](nllb/README.md) for more details including how to download the models and perform inference.
+
 According to our experience, the sentence encoder also supports code-switching, i.e.
 the same sentences can contain words in several different languages.
 
-We have also some evidence that the encoder can generalizes to other
+We have also some evidence that the encoder can generalize to other
 languages which have not been seen during training, but which are in
 a language family which is covered by other languages.
 
-A detailed description how the multilingual sentence embeddings are trained can
-be found in [6], together with an extensive experimental evaluation.
+A detailed description of how the multilingual sentence embeddings are trained can
+be found in [10], together with an experimental evaluation.
 
 ## Dependencies
 * Python 3.6
@@ -41,13 +36,15 @@ be found in [6], together with an extensive experimental evaluation.
 * [mecab 0.996](https://pypi.org/project/JapaneseTokenizer/), Japanese segmenter
 * tokenization from the Moses encoder (installed automatically)
 * [FastBPE](https://github.com/glample/fastBPE), fast C++ implementation of byte-pair encoding (installed automatically)
-* [Fairseq](https://github.com/pytorch/fairseq), sequence modeling toolkit (`pip install fairseq==0.10.2`)
+* [Fairseq](https://github.com/pytorch/fairseq), sequence modeling toolkit (`pip install fairseq==0.12.1`)
+* [tabulate](https://pypi.org/project/tabulate), pretty-print tabular data (`pip install tabulate`)
+* [pandas](https://pypi.org/project/pandas), data analysis toolkit (`pip install pandas`)
 * [Sentencepiece](https://github.com/google/sentencepiece), subword tokenization (installed automatically)
 
 ## Installation
 * set the environment variable 'LASER' to the root of the installation, e.g.
   `export LASER="${HOME}/projects/laser"`
-* download encoders from Amazon s3 by `bash ./install_models.sh`
+* download encoders from Amazon s3 by e.g. `bash ./nllb/download_models.sh` 
 * download third party software by `bash ./install_external_tools.sh`
 * download the data used in the example tasks (see description for each task)
 
@@ -76,7 +73,7 @@ LASER is BSD-licensed, as found in the [`LICENSE`](LICENSE) file in the root dir
 
 ## Supported languages
 
-Our model was trained on the following languages:
+The original LASER model was trained on the following languages:
 
 Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali,
 Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer,
@@ -94,6 +91,10 @@ We have also observed that the model seems to generalize well to other (minority
 Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian,
 Swiss German or Western Frisian.
 
+### LASER3
+
+Updated LASER models referred to as *[LASER3](nllb/README.md)* supplement the above list with support for 147 languages. The full list of supported languages can be seen [here](nllb/README.md#list-of-available-laser3-encoders).
+
 ## References
 
 [1] Holger Schwenk and Matthijs Douze,
@@ -130,3 +131,5 @@ Swiss German or Western Frisian.
 [9] Paul-Ambroise Duquenne, Hongyu Gong, Holger Schwenk,
     [*Multimodal and Multilingual Embeddings for Large-Scale Speech Mining,*](https://papers.nips.cc/paper/2021/hash/8466f9ace6a9acbe71f75762ffc890f1-Abstract.html), NeurIPS 2021, pages 15748-15761.
 
+[10] Kevin Heffernan, Onur Celebi, and Holger Schwenk,
+     [*Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages*](https://arxiv.org/abs/2205.12654)
diff --git a/nllb/README.md b/nllb/README.md
@@ -0,0 +1,182 @@
+![No Language Left Behind](nllb_laser3.png?raw=true "NLLB - LASER3")
+
+# LASER3 - No Language Left Behind
+
+As part of the project No Language Left Behind (NLLB) we have developed new LASER encoders, referred to here as LASER3. Each LASER3 encoder
+has a particular focus language which it supports, and the full list of available LASER3 encoders can be found at [the bottom of this README](#list-of-available-laser3-encoders). 
+
+We have also included an updated version of the original LASER encoder: LASER2. This improved model supports the same [languages](https://github.com/facebookresearch/LASER/#supported-languages) which LASER was trained on. In order to find more details on how both the LASER2 and LASER3 encoders were trained, please see [Heffernan et. al, 2022](https://arxiv.org/abs/2205.12654).
+
+## Downloading encoders
+
+To download the available encoders, please run the `download_models.sh` script within this directory. 
+```
+bash ./download_models.sh
+```
+LASER2 and all LASER3 encoders are downloaded by default. However, downloading all LASER3 encoders may take up a lot of disk space. Therefore, you may choose to select individual LASER3 encoders to download by supplying a list of available language codes (see [full list](#list-of-available-laser3-encoders)). 
+For example: `bash ./download_models.sh wol_Latn zul_Latn ...`
+
+By default, this download script will place all supported models within the calling directory.
+
+**Note**: LASER3 encoders for each focus language are in the format: `laser3-{language_code}`.
+
+## Embedding texts
+
+Once encoders are downloaded, you can then begin embedding texts by following the instructions [here](/tasks/embed/README.md).
+
+For example: `./LASER/tasks/embed/embed.sh [INFILE] [OUTFILE] wol_Latn`
+
+## List of available LASER3 encoders
+
+| Code | Language |
+|   :---:  |  :---:  |
+| ace_Latn | Acehnese (Latin script) |
+| aka_Latn | Akan |
+| als_Latn | Tosk Albanian |
+| amh_Ethi | Amharic |
+| asm_Beng | Assamese |
+| awa_Deva | Awadhi |
+| ayr_Latn | Central Aymara |
+| azb_Arab | South Azerbaijani |
+| azj_Latn | North Azerbaijani |
+| bak_Cyrl | Bashkir |
+| bam_Latn | Bambara |
+| ban_Latn | Balinese |
+| bel_Cyrl | Belarusian |
+| bem_Latn | Bemba |
+| ben_Beng | Bengali |
+| bho_Deva | Bhojpuri |
+| bjn_Latn | Banjar (Latin script) |
+| bod_Tibt | Standard Tibetan |
+| bug_Latn | Buginese |
+| ceb_Latn | Cebuano |
+| cjk_Latn | Chokwe |
+| ckb_Arab | Central Kurdish |
+| crh_Latn | Crimean Tatar |
+| cym_Latn | Welsh |
+| dik_Latn | Southwestern Dinka |
+| diq_Latn | Southern Zaza |
+| dyu_Latn | Dyula |
+| dzo_Tibt | Dzongkha |
+| ewe_Latn | Ewe |
+| fao_Latn | Faroese |
+| fij_Latn | Fijian |
+| fon_Latn | Fon |
+| fur_Latn | Friulian |
+| fuv_Latn | Nigerian Fulfulde |
+| gaz_Latn | West Central Oromo |
+| gla_Latn | Scottish Gaelic |
+| gle_Latn | Irish |
+| grn_Latn | Guarani |
+| guj_Gujr | Gujarati |
+| hat_Latn | Haitian Creole |
+| hau_Latn | Hausa |
+| hin_Deva | Hindi |
+| hne_Deva | Chhattisgarhi |
+| hye_Armn | Armenian |
+| ibo_Latn | Igbo |
+| ilo_Latn | Ilocano |
+| ind_Latn | Indonesian |
+| jav_Latn | Javanese |
+| kab_Latn | Kabyle |
+| kac_Latn | Jingpho |
+| kam_Latn | Kamba |
+| kan_Knda | Kannada |
+| kas_Arab | Kashmiri (Arabic script) |
+| kas_Deva | Kashmiri (Devanagari script) |
+| kat_Geor | Georgian |
+| kaz_Cyrl | Kazakh |
+| kbp_Latn | Kabiyè |
+| kea_Latn | Kabuverdianu |
+| khk_Cyrl | Halh Mongolian |
+| khm_Khmr | Khmer |
+| kik_Latn | Kikuyu |
+| kin_Latn | Kinyarwanda |
+| kir_Cyrl | Kyrgyz |
+| kmb_Latn | Kimbundu |
+| kmr_Latn | Northern Kurdish |
+| knc_Arab | Central Kanuri (Arabic script) |
+| knc_Latn | Central Kanuri (Latin script) |
+| kon_Latn | Kikongo |
+| lao_Laoo | Lao |
+| lij_Latn | Ligurian |
+| lim_Latn | Limburgish |
+| lin_Latn | Lingala |
+| lmo_Latn | Lombard |
+| ltg_Latn | Latgalian |
+| ltz_Latn | Luxembourgish |
+| lua_Latn | Luba-Kasai |
+| lug_Latn | Ganda |
+| luo_Latn | Luo |
+| lus_Latn | Mizo |
+| mag_Deva | Magahi |
+| mai_Deva | Maithili |
+| mal_Mlym | Malayalam |
+| mar_Deva | Marathi |
+| min_Latn | Minangkabau (Latin script) |
+| mlt_Latn | Maltese |
+| mni_Beng | Meitei (Bengali script) |
+| mos_Latn | Mossi |
+| mri_Latn | Maori |
+| mya_Mymr | Burmese |
+| npi_Deva | Nepali |
+| nso_Latn | Northern Sotho |
+| nus_Latn | Nuer |
+| nya_Latn | Nyanja |
+| ory_Orya | Odia |
+| pag_Latn | Pangasinan |
+| pan_Guru | Eastern Panjabi |
+| pap_Latn | Papiamento |
+| pbt_Arab | Southern Pashto |
+| pes_Arab | Western Persian |
+| plt_Latn | Plateau Malagasy |
+| prs_Arab | Dari |
+| quy_Latn | Ayacucho Quechua |
+| run_Latn | Rundi |
+| sag_Latn | Sango |
+| san_Deva | Sanskrit |
+| sat_Beng | Santali |
+| scn_Latn | Sicilian |
+| shn_Mymr | Shan |
+| sin_Sinh | Sinhala |
+| smo_Latn | Samoan |
+| sna_Latn | Shona |
+| snd_Arab | Sindhi |
+| som_Latn | Somali |
+| sot_Latn | Southern Sotho |
+| srd_Latn | Sardinian |
+| ssw_Latn | Swati |
+| sun_Latn | Sundanese |
+| swh_Latn | Swahili |
+| szl_Latn | Silesian |
+| tam_Taml | Tamil |
+| taq_Latn | Tamasheq (Latin script) |
+| tat_Cyrl | Tatar |
+| tel_Telu | Telugu |
+| tgk_Cyrl | Tajik |
+| tgl_Latn | Tagalog |
+| tha_Thai | Thai |
+| tir_Ethi | Tigrinya |
+| tpi_Latn | Tok Pisin |
+| tsn_Latn | Tswana |
+| tso_Latn | Tsonga |
+| tuk_Latn | Turkmen |
+| tum_Latn | Tumbuka |
+| tur_Latn | Turkish |
+| twi_Latn | Twi |
+| tzm_Tfng | Central Atlas Tamazight |
+| uig_Arab | Uyghur |
+| umb_Latn | Umbundu |
+| urd_Arab | Urdu |
+| uzn_Latn | Northern Uzbek |
+| vec_Latn | Venetian |
+| war_Latn | Waray |
+| wol_Latn | Wolof |
+| xho_Latn | Xhosa |
+| ydd_Hebr | Eastern Yiddish |
+| yor_Latn | Yoruba |
+| zsm_Latn | Standard Malay |
+| zul_Latn | Zulu |
+
+
+
diff --git a/nllb/download_models.sh b/nllb/download_models.sh
@@ -0,0 +1,89 @@
+#!/bin/bash
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+#
+# LASER  Language-Agnostic SEntence Representations
+# is a toolkit to calculate multilingual sentence embeddings
+# and to use them for document classification, bitext filtering
+# and mining
+# 
+#-------------------------------------------------------
+#
+# This bash script installs NLLB LASER2 and LASER3 sentence encoders from Amazon s3
+
+# default to download to current directory
+mdir=$(pwd)
+
+echo "Directory for model download: ${mdir}"
+
+version=1  # model version
+
+echo "Downloading networks..."
+
+if [ ! -d ${mdir} ] ; then
+  echo " - creating model directory: ${mdir}"
+  mkdir -p ${mdir}
+fi
+
+function download {
+    file=$1
+    if [ -f ${mdir}/${file} ] ; then
+        echo " - ${mdir}/$file already downloaded";
+    else
+        echo " - $s3/${file}";
+        wget -q $s3/${file};
+    fi   
+}
+
+cd ${mdir}  # move to model directory
+
+# available encoders
+s3="https://dl.fbaipublicfiles.com/nllb/laser"
+
+# LASER2 (download by default)
+if [ ! -f ${mdir}/laser2.pt ] ; then
+    echo " - $s3/laser2.pt"
+    wget --trust-server-names -q https://tinyurl.com/nllblaser2
+else 
+    echo " - ${mdir}/laser2.pt already downloaded"
+fi
+download "laser2.spm"
+download "laser2.cvocab"
+
+# LASER3 models
+if [ ! $# -eq 0 ]; then
+    # chosen model subset from command line
+    langs=$@
+else
+    # all available LASER3 models
+    langs=(ace_Latn aka_Latn als_Latn amh_Ethi asm_Beng awa_Deva ayr_Latn azb_Arab azj_Latn bak_Cyrl bam_Latn ban_Latn bel_Cyrl \
+        bem_Latn ben_Beng bho_Deva bjn_Latn bod_Tibt bug_Latn ceb_Latn cjk_Latn ckb_Arab crh_Latn cym_Latn dik_Latn diq_Latn \
+        dyu_Latn dzo_Tibt ewe_Latn fao_Latn fij_Latn fon_Latn fur_Latn fuv_Latn gaz_Latn gla_Latn gle_Latn grn_Latn guj_Gujr \
+        hat_Latn hau_Latn hin_Deva hne_Deva hye_Armn ibo_Latn ilo_Latn ind_Latn jav_Latn kab_Latn kac_Latn kam_Latn kan_Knda \
+        kas_Arab kas_Deva kat_Geor kaz_Cyrl kbp_Latn kea_Latn khk_Cyrl khm_Khmr kik_Latn kin_Latn kir_Cyrl kmb_Latn kmr_Latn \
+        knc_Arab knc_Latn kon_Latn lao_Laoo lij_Latn lim_Latn lin_Latn lmo_Latn ltg_Latn ltz_Latn lua_Latn lug_Latn luo_Latn \
+        lus_Latn mag_Deva mai_Deva mal_Mlym mar_Deva min_Latn mlt_Latn mni_Beng mos_Latn mri_Latn mya_Mymr npi_Deva nso_Latn \
+        nus_Latn nya_Latn ory_Orya pag_Latn pan_Guru pap_Latn pbt_Arab pes_Arab plt_Latn prs_Arab quy_Latn run_Latn sag_Latn \
+        san_Deva sat_Beng scn_Latn shn_Mymr sin_Sinh smo_Latn sna_Latn snd_Arab som_Latn sot_Latn srd_Latn ssw_Latn sun_Latn \
+        swh_Latn szl_Latn tam_Taml taq_Latn tat_Cyrl tel_Telu tgk_Cyrl tgl_Latn tha_Thai tir_Ethi tpi_Latn tsn_Latn tso_Latn \
+        tuk_Latn tum_Latn tur_Latn twi_Latn tzm_Tfng uig_Arab umb_Latn urd_Arab uzn_Latn vec_Latn war_Latn wol_Latn xho_Latn \
+        ydd_Hebr yor_Latn zsm_Latn zul_Latn)
+fi
+
+spm_langs=(amh_Ethi ayr_Latn azj_Latn bak_Cyrl bel_Cyrl bod_Tibt ckb_Arab crh_Latn dik_Latn dzo_Tibt fur_Latn \
+           fuv_Latn grn_Latn kab_Latn kac_Latn kaz_Cyrl kir_Cyrl kmr_Latn lij_Latn lim_Latn lmo_Latn ltg_Latn \
+           mya_Mymr pbt_Arab pes_Arab prs_Arab sat_Beng scn_Latn srd_Latn szl_Latn taq_Latn tgk_Cyrl tir_Ethi \
+           tzm_Tfng vec_Latn)
+
+for lang in ${langs[@]}; do
+    download "laser3-$lang.v$version.pt";
+    for spm_lang in ${spm_langs[@]}; do
+        if [[ $lang == $spm_lang ]] ; then
+            download "laser3-$lang.v$version.spm";
+            download "laser3-$lang.v$version.cvocab";
+        fi 
+    done
+done
diff --git a/nllb/nllb_laser3.png b/nllb/nllb_laser3.png