This repository consists of language resources reported on a paper wih the same title at LREC2020 paper by Elmurod Kuriyozov, Yerai Doval and Carlos Gomez-Rodriguez. The paper itslef is here.
If you use it for your research, please make sure to cite it as follows:
@inproceedings{kuriyozov2020cross,
title={Cross-Lingual Word Embeddings for Turkic Languages},
author={Kuriyozov, Elmurod and Doval, Yerai and G{\'o}mez-Rodr{\'\i}guez, Carlos},
booktitle={Proceedings of the 12th Language Resources and Evaluation Conference},
pages={4054--4062},
year={2020}
}
There are dictionaries obtained from existing resources for Turkish-English and Uzbek-English (Kazakh-English reported at the paper cannot be shared due to licence issues).
Turkish-English dictionary was obtained from MUSE Uzbek-English dictionary was obtained from The Uzbek Glossary Kazakh-English dictionary file cannot be shared diractly, but can be obtained from The Leneshmid Dictionary
There are dictionaries from five Turkic languages: Turkish, Uzbek, Azeri, Kazakh and Kyrgyz to English using Google Translate API. Sizes (in words): Turkish - English: 9350 Uzbek - English: 7958 Azeri - English: 7422 Kazakh - English: 8454 Kyrgyz - English: 7974
Pre-trained word embeddings for these five Turkic languages are available already, one of them we used for our experiment is FastText
Apart from that, we trained our own word embeddings with skip-gram model of FastText using Large Corpora of Turkic LanguagesBaisa et al. 2012
All pre-trained word embeddings can be downloaded from links below.
Turkish FastText skip-gram 300d word-embeddings - Download
Uzbek FastText skip-gram 300d word-embeddings - Download
Azeri FastText skip-gram 300d word-embeddings - Download
Kazakh FastText skip-gram 300d word-embeddings - Download
Kyrgyz FastText skip-gram 300d word-embeddings - Download