MetaVec is a monolingual and cross-lingual meta-embedding generation framework.
MetaVec outperforms every previously proposed meta-embedding generation method. Our best-meta embedding achieves the best-published results in a wide range of intrinsic evaluation tasks.
You can download our pre-computed meta-embedding. This meta embeddings combines FastText, Numberbatch, JOINTChyb and Paragram.
Click to Download | Words | Dimensions | Size | Link |
---|---|---|---|---|
MetaVec | 4,573,185 | 300 | 11.8GB | https://adimen.si.ehu.es/~igarcia/embeddings/MetaVec.zip |
MetaVec is a very large embedding. We provide reduced vocabulary versions of the Meta-Embedding. The vector representations are the same, but we include only a subset of the words in the vocabulary.
Click to Download | Words | Dimensions | Size | |
---|---|---|---|---|
MetaVec 2M | 1,999,995 | 300 | 4.8GB | Only words in crawl-300d-2M.vec (FastText Common Crawl) vocabulary |
MetaVec 1M | 830,063 | 300 | 2GB | Only words in wiki-news-300d-1M.vec (FastText Wikipedia) vocabulary |
MetaVec 0.2M | 186,647 | 300 | 0.45GB | Only words in the 200,000 Most Common English Words List according to the Google's Trillion Words Corpus |
@inproceedings{garcia-ferrero-etal-2021-benchmarking-meta,
title = "Benchmarking Meta-embeddings: What Works and What Does Not",
author = "Garc{\'\i}a-Ferrero, Iker and
Agerri, Rodrigo and
Rigau, German",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-emnlp.333",
pages = "3957--3972",
If you only want to generate meta-embeddings you need:
- Python 3 (3.7 or greater)
- Numpy
- tqdm
- Cupy, if you have an Nvidia GPU you can install cupy to use the GPU instead of the CPU for faster matrix operations: https://docs.cupy.dev/en/stable/install.html
If you want to run the evaluation framework we have prepared a conda environment file that will install all the required dependencies. This file will create a conda environment called "metavec" with the required dependencies to generate meta-embeddings, run the intrinsic and run the extrinsic evaluation framework.
conda env create -f environment.yml
conda activate metavec
If you don't want to use conda you can inspect the "environment.yml" file to see the required dependencies and manually install them.
If you just want to generate meta-embeddings you only need to clone the VecMap repository inside the MetaVec directory. The following command will do that for you:
sh get_third_party.sh
If you want to generate meta-embedding and evaluate them you need to clone the Vecmap, word embedding benchmarks and Jiant repositories. For the last two ones, we will download a modified version to run the same configuration used in our paper. The script will also install and download the nltk and spacy required packages for evaluation. To run this command you first need to install the evaluation framework dependencies.
sh get_third_party.sh all
Here is an example command to generate a meta-embedding using FastText, JointcHYB, paragram and numberbatch as source embeddings.
- embeddings: Path of the source embeddings you want to combine to generate a meta-embedding
- rotate_to: Path to the embedding to which all source embedding will be aligned using VecMap, it doesn't need to be one of the source embeddings.
- output_path: Path where the generated meta-embedding will be saved
python run_metavec.py \
--embeddings embeddings/crawl-300d-2M.vec embeddings/JOINTC-HYB-ENES.emb embeddings/paragram_ws353.vec embeddings/numberbatch-en.txt \
--rotate_to embeddings/crawl-300d-2M.vec \
--output_path embeddings/MetaVec.vec
See embeddings/README.md for instructions on how to download the source embeddings that we test in our paper.
The intrinsic evaluation uses the Word Embedding Benchmark toolkit: https://github.com/kudkudak/word-embeddings-benchmarks
Evaluate a Word Embedding
python instrinsic_evaluation.py -i embeddings/MetaVec.vec
Evaluate all the Word Embeddings in a directory
python instrinsic_evaluation.py -d embeddings/
If you want to set a custom output directory for the evaluation results use the "--output_dir path" argument.
The extrinsic evaluation uses the Jiant-V1 toolkit: https://github.com/nyu-mll/jiant-v1-legacy
Evaluate a Word Embedding
python extrinsic_evaluation.py -i embeddings/MetaVec.vec
Evaluate all the Word Embeddings in a directory
python extrinsic_evaluation.py -d embeddings/
If you want to set a custom output directory for the evaluation results use the "--output_dir path" argument.