-
Notifications
You must be signed in to change notification settings - Fork 463
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f4689ad
commit 8d66f0e
Showing
12 changed files
with
1,053 additions
and
131 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,182 @@ | ||
![No Language Left Behind](nllb_laser3.png?raw=true "NLLB - LASER3") | ||
|
||
# LASER3 - No Language Left Behind | ||
|
||
As part of the project No Language Left Behind (NLLB) we have developed new LASER encoders, referred to here as LASER3. Each LASER3 encoder | ||
has a particular focus language which it supports, and the full list of available LASER3 encoders can be found at [the bottom of this README](#list-of-available-laser3-encoders). | ||
|
||
We have also included an updated version of the original LASER encoder: LASER2. This improved model supports the same [languages](https://github.com/facebookresearch/LASER/#supported-languages) which LASER was trained on. In order to find more details on how both the LASER2 and LASER3 encoders were trained, please see [Heffernan et. al, 2022](https://arxiv.org/abs/2205.12654). | ||
|
||
## Downloading encoders | ||
|
||
To download the available encoders, please run the `download_models.sh` script within this directory. | ||
``` | ||
bash ./download_models.sh | ||
``` | ||
LASER2 and all LASER3 encoders are downloaded by default. However, downloading all LASER3 encoders may take up a lot of disk space. Therefore, you may choose to select individual LASER3 encoders to download by supplying a list of available language codes (see [full list](#list-of-available-laser3-encoders)). | ||
For example: `bash ./download_models.sh wol_Latn zul_Latn ...` | ||
|
||
By default, this download script will place all supported models within the calling directory. | ||
|
||
**Note**: LASER3 encoders for each focus language are in the format: `laser3-{language_code}`. | ||
|
||
## Embedding texts | ||
|
||
Once encoders are downloaded, you can then begin embedding texts by following the instructions [here](/tasks/embed/README.md). | ||
|
||
For example: `./LASER/tasks/embed/embed.sh [INFILE] [OUTFILE] wol_Latn` | ||
|
||
## List of available LASER3 encoders | ||
|
||
| Code | Language | | ||
| :---: | :---: | | ||
| ace_Latn | Acehnese (Latin script) | | ||
| aka_Latn | Akan | | ||
| als_Latn | Tosk Albanian | | ||
| amh_Ethi | Amharic | | ||
| asm_Beng | Assamese | | ||
| awa_Deva | Awadhi | | ||
| ayr_Latn | Central Aymara | | ||
| azb_Arab | South Azerbaijani | | ||
| azj_Latn | North Azerbaijani | | ||
| bak_Cyrl | Bashkir | | ||
| bam_Latn | Bambara | | ||
| ban_Latn | Balinese | | ||
| bel_Cyrl | Belarusian | | ||
| bem_Latn | Bemba | | ||
| ben_Beng | Bengali | | ||
| bho_Deva | Bhojpuri | | ||
| bjn_Latn | Banjar (Latin script) | | ||
| bod_Tibt | Standard Tibetan | | ||
| bug_Latn | Buginese | | ||
| ceb_Latn | Cebuano | | ||
| cjk_Latn | Chokwe | | ||
| ckb_Arab | Central Kurdish | | ||
| crh_Latn | Crimean Tatar | | ||
| cym_Latn | Welsh | | ||
| dik_Latn | Southwestern Dinka | | ||
| diq_Latn | Southern Zaza | | ||
| dyu_Latn | Dyula | | ||
| dzo_Tibt | Dzongkha | | ||
| ewe_Latn | Ewe | | ||
| fao_Latn | Faroese | | ||
| fij_Latn | Fijian | | ||
| fon_Latn | Fon | | ||
| fur_Latn | Friulian | | ||
| fuv_Latn | Nigerian Fulfulde | | ||
| gaz_Latn | West Central Oromo | | ||
| gla_Latn | Scottish Gaelic | | ||
| gle_Latn | Irish | | ||
| grn_Latn | Guarani | | ||
| guj_Gujr | Gujarati | | ||
| hat_Latn | Haitian Creole | | ||
| hau_Latn | Hausa | | ||
| hin_Deva | Hindi | | ||
| hne_Deva | Chhattisgarhi | | ||
| hye_Armn | Armenian | | ||
| ibo_Latn | Igbo | | ||
| ilo_Latn | Ilocano | | ||
| ind_Latn | Indonesian | | ||
| jav_Latn | Javanese | | ||
| kab_Latn | Kabyle | | ||
| kac_Latn | Jingpho | | ||
| kam_Latn | Kamba | | ||
| kan_Knda | Kannada | | ||
| kas_Arab | Kashmiri (Arabic script) | | ||
| kas_Deva | Kashmiri (Devanagari script) | | ||
| kat_Geor | Georgian | | ||
| kaz_Cyrl | Kazakh | | ||
| kbp_Latn | Kabiyè | | ||
| kea_Latn | Kabuverdianu | | ||
| khk_Cyrl | Halh Mongolian | | ||
| khm_Khmr | Khmer | | ||
| kik_Latn | Kikuyu | | ||
| kin_Latn | Kinyarwanda | | ||
| kir_Cyrl | Kyrgyz | | ||
| kmb_Latn | Kimbundu | | ||
| kmr_Latn | Northern Kurdish | | ||
| knc_Arab | Central Kanuri (Arabic script) | | ||
| knc_Latn | Central Kanuri (Latin script) | | ||
| kon_Latn | Kikongo | | ||
| lao_Laoo | Lao | | ||
| lij_Latn | Ligurian | | ||
| lim_Latn | Limburgish | | ||
| lin_Latn | Lingala | | ||
| lmo_Latn | Lombard | | ||
| ltg_Latn | Latgalian | | ||
| ltz_Latn | Luxembourgish | | ||
| lua_Latn | Luba-Kasai | | ||
| lug_Latn | Ganda | | ||
| luo_Latn | Luo | | ||
| lus_Latn | Mizo | | ||
| mag_Deva | Magahi | | ||
| mai_Deva | Maithili | | ||
| mal_Mlym | Malayalam | | ||
| mar_Deva | Marathi | | ||
| min_Latn | Minangkabau (Latin script) | | ||
| mlt_Latn | Maltese | | ||
| mni_Beng | Meitei (Bengali script) | | ||
| mos_Latn | Mossi | | ||
| mri_Latn | Maori | | ||
| mya_Mymr | Burmese | | ||
| npi_Deva | Nepali | | ||
| nso_Latn | Northern Sotho | | ||
| nus_Latn | Nuer | | ||
| nya_Latn | Nyanja | | ||
| ory_Orya | Odia | | ||
| pag_Latn | Pangasinan | | ||
| pan_Guru | Eastern Panjabi | | ||
| pap_Latn | Papiamento | | ||
| pbt_Arab | Southern Pashto | | ||
| pes_Arab | Western Persian | | ||
| plt_Latn | Plateau Malagasy | | ||
| prs_Arab | Dari | | ||
| quy_Latn | Ayacucho Quechua | | ||
| run_Latn | Rundi | | ||
| sag_Latn | Sango | | ||
| san_Deva | Sanskrit | | ||
| sat_Beng | Santali | | ||
| scn_Latn | Sicilian | | ||
| shn_Mymr | Shan | | ||
| sin_Sinh | Sinhala | | ||
| smo_Latn | Samoan | | ||
| sna_Latn | Shona | | ||
| snd_Arab | Sindhi | | ||
| som_Latn | Somali | | ||
| sot_Latn | Southern Sotho | | ||
| srd_Latn | Sardinian | | ||
| ssw_Latn | Swati | | ||
| sun_Latn | Sundanese | | ||
| swh_Latn | Swahili | | ||
| szl_Latn | Silesian | | ||
| tam_Taml | Tamil | | ||
| taq_Latn | Tamasheq (Latin script) | | ||
| tat_Cyrl | Tatar | | ||
| tel_Telu | Telugu | | ||
| tgk_Cyrl | Tajik | | ||
| tgl_Latn | Tagalog | | ||
| tha_Thai | Thai | | ||
| tir_Ethi | Tigrinya | | ||
| tpi_Latn | Tok Pisin | | ||
| tsn_Latn | Tswana | | ||
| tso_Latn | Tsonga | | ||
| tuk_Latn | Turkmen | | ||
| tum_Latn | Tumbuka | | ||
| tur_Latn | Turkish | | ||
| twi_Latn | Twi | | ||
| tzm_Tfng | Central Atlas Tamazight | | ||
| uig_Arab | Uyghur | | ||
| umb_Latn | Umbundu | | ||
| urd_Arab | Urdu | | ||
| uzn_Latn | Northern Uzbek | | ||
| vec_Latn | Venetian | | ||
| war_Latn | Waray | | ||
| wol_Latn | Wolof | | ||
| xho_Latn | Xhosa | | ||
| ydd_Hebr | Eastern Yiddish | | ||
| yor_Latn | Yoruba | | ||
| zsm_Latn | Standard Malay | | ||
| zul_Latn | Zulu | | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
#!/bin/bash | ||
# Copyright (c) Facebook, Inc. and its affiliates. | ||
# All rights reserved. | ||
# | ||
# This source code is licensed under the BSD-style license found in the | ||
# LICENSE file in the root directory of this source tree. | ||
# | ||
# LASER Language-Agnostic SEntence Representations | ||
# is a toolkit to calculate multilingual sentence embeddings | ||
# and to use them for document classification, bitext filtering | ||
# and mining | ||
# | ||
#------------------------------------------------------- | ||
# | ||
# This bash script installs NLLB LASER2 and LASER3 sentence encoders from Amazon s3 | ||
|
||
# default to download to current directory | ||
mdir=$(pwd) | ||
|
||
echo "Directory for model download: ${mdir}" | ||
|
||
version=1 # model version | ||
|
||
echo "Downloading networks..." | ||
|
||
if [ ! -d ${mdir} ] ; then | ||
echo " - creating model directory: ${mdir}" | ||
mkdir -p ${mdir} | ||
fi | ||
|
||
function download { | ||
file=$1 | ||
if [ -f ${mdir}/${file} ] ; then | ||
echo " - ${mdir}/$file already downloaded"; | ||
else | ||
echo " - $s3/${file}"; | ||
wget -q $s3/${file}; | ||
fi | ||
} | ||
|
||
cd ${mdir} # move to model directory | ||
|
||
# available encoders | ||
s3="https://dl.fbaipublicfiles.com/nllb/laser" | ||
|
||
# LASER2 (download by default) | ||
if [ ! -f ${mdir}/laser2.pt ] ; then | ||
echo " - $s3/laser2.pt" | ||
wget --trust-server-names -q https://tinyurl.com/nllblaser2 | ||
else | ||
echo " - ${mdir}/laser2.pt already downloaded" | ||
fi | ||
download "laser2.spm" | ||
download "laser2.cvocab" | ||
|
||
# LASER3 models | ||
if [ ! $# -eq 0 ]; then | ||
# chosen model subset from command line | ||
langs=$@ | ||
else | ||
# all available LASER3 models | ||
langs=(ace_Latn aka_Latn als_Latn amh_Ethi asm_Beng awa_Deva ayr_Latn azb_Arab azj_Latn bak_Cyrl bam_Latn ban_Latn bel_Cyrl \ | ||
bem_Latn ben_Beng bho_Deva bjn_Latn bod_Tibt bug_Latn ceb_Latn cjk_Latn ckb_Arab crh_Latn cym_Latn dik_Latn diq_Latn \ | ||
dyu_Latn dzo_Tibt ewe_Latn fao_Latn fij_Latn fon_Latn fur_Latn fuv_Latn gaz_Latn gla_Latn gle_Latn grn_Latn guj_Gujr \ | ||
hat_Latn hau_Latn hin_Deva hne_Deva hye_Armn ibo_Latn ilo_Latn ind_Latn jav_Latn kab_Latn kac_Latn kam_Latn kan_Knda \ | ||
kas_Arab kas_Deva kat_Geor kaz_Cyrl kbp_Latn kea_Latn khk_Cyrl khm_Khmr kik_Latn kin_Latn kir_Cyrl kmb_Latn kmr_Latn \ | ||
knc_Arab knc_Latn kon_Latn lao_Laoo lij_Latn lim_Latn lin_Latn lmo_Latn ltg_Latn ltz_Latn lua_Latn lug_Latn luo_Latn \ | ||
lus_Latn mag_Deva mai_Deva mal_Mlym mar_Deva min_Latn mlt_Latn mni_Beng mos_Latn mri_Latn mya_Mymr npi_Deva nso_Latn \ | ||
nus_Latn nya_Latn ory_Orya pag_Latn pan_Guru pap_Latn pbt_Arab pes_Arab plt_Latn prs_Arab quy_Latn run_Latn sag_Latn \ | ||
san_Deva sat_Beng scn_Latn shn_Mymr sin_Sinh smo_Latn sna_Latn snd_Arab som_Latn sot_Latn srd_Latn ssw_Latn sun_Latn \ | ||
swh_Latn szl_Latn tam_Taml taq_Latn tat_Cyrl tel_Telu tgk_Cyrl tgl_Latn tha_Thai tir_Ethi tpi_Latn tsn_Latn tso_Latn \ | ||
tuk_Latn tum_Latn tur_Latn twi_Latn tzm_Tfng uig_Arab umb_Latn urd_Arab uzn_Latn vec_Latn war_Latn wol_Latn xho_Latn \ | ||
ydd_Hebr yor_Latn zsm_Latn zul_Latn) | ||
fi | ||
|
||
spm_langs=(amh_Ethi ayr_Latn azj_Latn bak_Cyrl bel_Cyrl bod_Tibt ckb_Arab crh_Latn dik_Latn dzo_Tibt fur_Latn \ | ||
fuv_Latn grn_Latn kab_Latn kac_Latn kaz_Cyrl kir_Cyrl kmr_Latn lij_Latn lim_Latn lmo_Latn ltg_Latn \ | ||
mya_Mymr pbt_Arab pes_Arab prs_Arab sat_Beng scn_Latn srd_Latn szl_Latn taq_Latn tgk_Cyrl tir_Ethi \ | ||
tzm_Tfng vec_Latn) | ||
|
||
for lang in ${langs[@]}; do | ||
download "laser3-$lang.v$version.pt"; | ||
for spm_lang in ${spm_langs[@]}; do | ||
if [[ $lang == $spm_lang ]] ; then | ||
download "laser3-$lang.v$version.spm"; | ||
download "laser3-$lang.v$version.cvocab"; | ||
fi | ||
done | ||
done |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.