Everything below this line is from the original MUSE repository

Note This repository is no longer actively maintained by Babylon Health. For further assistance, reach out to the paper authors.

Accepted submission <1022> for ACL Conference (Fork of MUSE repository to reproduce experiments for submission)

This is a fork of the MUSE repository with changes to reproduce experiments of Multilingual Factor Analysis presented at ACL 2019. Both MUSE and DINU 2014 dataset results can be reproduced with this repository.

The file containing the implementation of the methods in the paper (IBFA and MBFA) is alignment_functions.py, the function that builds the dictionary of triples is at the bottom of utils.py, everything else is to reproduce experimental results.

Dependencies

Python 3 with NumPy and SciPy
Pandas and scikit-learn
PyTorch
Faiss (recommended) for fast nearest neighbor search (CPU or GPU).

Faiss is optional for GPU users - though Faiss-GPU will greatly speed up nearest neighbor search - and highly recommended for CPU users.

The changes for the ACL submission require Python3 - we recommend using Python3.6 with a virtual environment. After cloning the repository, the following steps should be enough to set up the dependencies:

Make the environment: python -m venv env
Activate it: source env/bin/activate
Install required modules: pip install -r requirements.txt
Download evaluation datasets and all relevant monolingual word embeddings (refer to the original MUSE instructions below)
Download the Dinu et al. (2014) embeddings and dictionary from https://zenodo.org/record/2654864 (If the link does not work, refer to https://arxiv.org/pdf/1412.6568.pdf instead).

After installing the prerequisites and downloading the relevant embeddings and dictionaries, run:

python supervised_multiview.py --cuda 0 --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec

to reproduce the IBFA row (for en-es) in Table 1.

Run

python multilingual_alignment_experiments.py

to reproduce the en-it IBFA row in Table 2. The --swap and --expert arguments will allow you to reproduce all results for IBFA in Tables 2 and 3.

Run

python create_embeddings.py

to create aligned en-es embeddings, which can then be used to reproduce the corresponding rows in Tables 4 and 5. This will also generate a samples_en_es.txt file that contains the sampled pairs of words in Table 8.

Run

python supervised_multiview.py --cuda 0 --src_lang en --tgt_lang it --src_emb data/wiki.en.vec --tgt_emb data/wiki.it.vec

to reproduce the IBFA results in Table 6.

Run

python supervised_multiview.py --cuda 0 --src_lang en --tgt_lang fr --aux_lang it --src_emb data/wiki.en.vec --tgt_emb data/wiki.fr.vec --aux_emb data/wiki.it.vec --fitting_method em

to reproduce the MBFA results in Table 7.

Everything below this line is from the original MUSE repository

Get evaluation datasets

To download monolingual and cross-lingual word embeddings evaluation datasets:

Our 110 bilingual dictionaries
28 monolingual word similarity tasks for 6 languages, and the English word analogy task
Cross-lingual word similarity tasks from SemEval2017
Sentence translation retrieval with Europarl corpora

You can simply run:

cd data/
wget https://dl.fbaipublicfiles.com/arrival/vectors.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/wordsim.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/dictionaries.tar.gz

Alternatively, you can also download the data with:

cd data/
./get_evaluation.sh

Note: Requires bash 4. The download of Europarl is disabled by default (slow), you can enable it here.

Get monolingual word embeddings

For pre-trained monolingual word embeddings, we highly recommend fastText Wikipedia embeddings, or using fastText to train your own word embeddings from your corpus.

You can download the English (en) and Spanish (es) embeddings this way:

# English fastText Wikipedia embeddings
curl -Lo data/wiki.en.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
# Spanish fastText Wikipedia embeddings
curl -Lo data/wiki.es.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.vec

Align monolingual word embeddings

This project includes two ways to obtain cross-lingual word embeddings:

Supervised: using a train bilingual dictionary (or identical character strings as anchor points), learn a mapping from the source to the target space using (iterative) Procrustes alignment.
Unsupervised: without any parallel data or anchor point, learn a mapping from the source to the target space using adversarial training and (iterative) Procrustes refinement.

For more details on these approaches, please check here.

The supervised way: iterative Procrustes (CPU|GPU)

To learn a mapping between the source and the target space, simply run:

python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5 --dico_train default

By default, dico_train will point to our ground-truth dictionaries (downloaded above); when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary. Logs and embeddings will be saved in the dumped/ directory.

The unsupervised way: adversarial training and refinement (CPU|GPU)

To learn a mapping using adversarial training and iterative Procrustes refinement, run:

python unsupervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5

By default, the validation metric is the mean cosine of word pairs from a synthetic dictionary built with CSLS (Cross-domain similarity local scaling). For some language pairs (e.g. En-Zh), we recommend to center the embeddings using --normalize_embeddings center.

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

We also include a simple script to evaluate the quality of monolingual or cross-lingual word embeddings on several tasks:

Monolingual

python evaluate.py --src_lang en --src_emb data/wiki.en.vec --max_vocab 200000

Cross-lingual

python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000

Word embedding format

By default, the aligned embeddings are exported to a text format at the end of experiments: --export txt. Exporting embeddings to a text file can take a while if you have a lot of embeddings. For a very fast export, you can set --export pth to export the embeddings in a PyTorch binary file, or simply disable the export (--export "").

When loading embeddings, the model can load:

PyTorch binary files previously generated by MUSE (.pth files)
fastText binary files previously generated by fastText (.bin files)
text files (text file with one word embedding per line)

The two first options are very fast and can load 1 million embeddings in a few seconds, while loading text files can take a while.

Download

We provide multilingual embeddings and ground-truth bilingual dictionaries. These embeddings are fastText embeddings that have been aligned in a common space.

Multilingual word Embeddings

We release fastText Wikipedia supervised word embeddings for 30 languages, aligned in a single vector space.


Arabic: text	Bulgarian: text	Catalan: text	Croatian: text	Czech: text	Danish: text
Dutch: text	English: text	Estonian: text	Finnish: text	French: text	German: text
Greek: text	Hebrew: text	Hungarian: text	Indonesian: text	Italian: text	Macedonian: text
Norwegian: text	Polish: text	Portuguese: text	Romanian: text	Russian: text	Slovak: text
Slovenian: text	Spanish: text	Swedish: text	Turkish: text	Ukrainian: text	Vietnamese: text

You can visualize crosslingual nearest neighbors using demo.ipynb.

Ground-truth bilingual dictionaries

We created 110 large-scale ground-truth bilingual dictionaries using an internal translation tool. The dictionaries handle well the polysemy of words. We provide a train and test split of 5000 and 1500 unique source words, as well as a larger set of up to 100k pairs. Our goal is to ease the development and the evaluation of cross-lingual word embeddings and multilingual NLP.

European languages in every direction

src-tgt	German	English	Spanish	French	Italian	Portuguese
German	-	full train test	full train test	full train test	full train test	full train test
English	full train test	-	full train test	full train test	full train test	full train test
Spanish	full train test	full train test	-	full train test	full train test	full train test
French	full train test	full train test	full train test	-	full train test	full train test
Italian	full train test	full train test	full train test	full train test	-	full train test
Portuguese	full train test	full train test	full train test	full train test	full train test	-

Other languages to English (e.g. {fr,es}-en)


Afrikaans: full train test	Albanian: full train test	Arabic: full train test	Bengali: full train test
Bosnian: full train test	Bulgarian: full train test	Catalan: full train test	Chinese: full train test
Croatian: full train test	Czech: full train test	Danish: full train test	Dutch: full train test
English: full train test	Estonian: full train test	Filipino: full train test	Finnish: full train test
French: full train test	German: full train test	Greek: full train test	Hebrew: full train test
Hindi: full train test	Hungarian: full train test	Indonesian: full train test	Italian: full train test
Japanese: full train test	Korean: full train test	Latvian: full train test	Littuanian: full train test
Macedonian: full train test	Malay: full train test	Norwegian: full train test	Persian: full train test
Polish: full train test	Portuguese: full train test	Romanian: full train test	Russian: full train test
Slovak: full train test	Slovenian: full train test	Spanish: full train test	Swedish: full train test
Tamil: full train test	Thai: full train test	Turkish: full train test	Ukrainian: full train test
Vietnamese: full train test

English to other languages (e.g. en-{fr,es})


Afrikaans: full train test	Albanian: full train test	Arabic: full train test	Bengali: full train test
Bosnian: full train test	Bulgarian: full train test	Catalan: full train test	Chinese: full train test
Croatian: full train test	Czech: full train test	Danish: full train test	Dutch: full train test
English: full train test	Estonian: full train test	Filipino: full train test	Finnish: full train test
French: full train test	German: full train test	Greek: full train test	Hebrew: full train test
Hindi: full train test	Hungarian: full train test	Indonesian: full train test	Italian: full train test
Japanese: full train test	Korean: full train test	Latvian: full train test	Littuanian: full train test
Macedonian: full train test	Malay: full train test	Norwegian: full train test	Persian: full train test
Polish: full train test	Portuguese: full train test	Romanian: full train test	Russian: full train test
Slovak: full train test	Slovenian: full train test	Spanish: full train test	Swedish: full train test
Tamil: full train test	Thai: full train test	Turkish: full train test	Ukrainian: full train test
Vietnamese: full train test

References

Please cite [1] if you found the resources in this repository useful.

Word Translation Without Parallel Data

[1] A. Conneau*, G. Lample*, L. Denoyer, MA. Ranzato, H. Jégou, Word Translation Without Parallel Data

* Equal contribution. Order has been determined with a coin flip.

@article{conneau2017word,
  title={Word Translation Without Parallel Data},
  author={Conneau, Alexis and Lample, Guillaume and Ranzato, Marc'Aurelio and Denoyer, Ludovic and J{\'e}gou, Herv{\'e}},
  journal={arXiv preprint arXiv:1710.04087},
  year={2017}
}

MUSE is the project at the origin of the work on unsupervised machine translation with monolingual data only [2].

Unsupervised Machine Translation With Monolingual Data Only

[2] G. Lample, A. Conneau, L. Denoyer, MA. Ranzato Unsupervised Machine Translation With Monolingual Data Only

@article{lample2017unsupervised,
  title={Unsupervised Machine Translation Using Monolingual Corpora Only},
  author={Lample, Guillaume and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1711.00043},
  year={2017}
}

Related work

Contact: [email protected] [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Accepted submission <1022> for ACL Conference (Fork of MUSE repository to reproduce experiments for submission)

Dependencies

Everything below this line is from the original MUSE repository

Get evaluation datasets

Get monolingual word embeddings

Align monolingual word embeddings

The supervised way: iterative Procrustes (CPU|GPU)

The unsupervised way: adversarial training and refinement (CPU|GPU)

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

Word embedding format

Download

Multilingual word Embeddings

Ground-truth bilingual dictionaries

References

Word Translation Without Parallel Data

Unsupervised Machine Translation With Monolingual Data Only

Related work

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
data		data
src		src
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
LICENSE_MUSE		LICENSE_MUSE
NOTICE		NOTICE
README.md		README.md
create_embeddings.py		create_embeddings.py
demo.ipynb		demo.ipynb
evaluate.py		evaluate.py
multilingual_alignment_experiments.py		multilingual_alignment_experiments.py
outline_all.png		outline_all.png
requirements.txt		requirements.txt
supervised.py		supervised.py
supervised_multiview.py		supervised_multiview.py
unsupervised.py		unsupervised.py

License

babylonhealth/MultilingualFactorAnalysis

Folders and files

Latest commit

History

Repository files navigation

Accepted submission <1022> for ACL Conference (Fork of MUSE repository to reproduce experiments for submission)

Dependencies

Everything below this line is from the original MUSE repository

Get evaluation datasets

Get monolingual word embeddings

Align monolingual word embeddings

The supervised way: iterative Procrustes (CPU|GPU)

The unsupervised way: adversarial training and refinement (CPU|GPU)

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

Word embedding format

Download

Multilingual word Embeddings

Ground-truth bilingual dictionaries

References

Word Translation Without Parallel Data

Unsupervised Machine Translation With Monolingual Data Only

Related work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages