Open data from https://tezaurs.lv - an extensive dictionary and thesaurus of Latvian, comprising more than 315,000 lexical entries.
This is unmaintained - the latest data releases are on CLARIN repsitory at https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/66
- Wordlists with metadata (under
wordlists
). - Synonymic references (under
wordlists
). - Glosses, etc. (under
entries
).
- Multi-word expressions (extraced from a balanced 10M text corpus of Latvian) - under
mwe
. - Mapping of Tēzaurs entries to core WordNet synsets (experimental) - under
wordnet
.
See entries.txt
and references.txt
under wordlists
. Entries is a list of main headwords. References is a list of derivatives of the main headwords. Named entities, acronyms, abbreviations, prefixes, etc. are not included.
Data format: tab-separated records consisting of 9 fields:
- Headword.
- Homonym / homograph index (0..N).
- Universal POS tag, or
NULL
. - Inflectional paradigm1 (0..N), or
NULL
. - Infinitive stem1 (if the paradigm is 15 or 18), or
NULL
. - Comma-separated present stems1 (if the paradigm is 15 or 18), or
NULL
. - Comma-separated past stems1 (if the paradigm is 15 or 18), or
NULL
. - Verb prefix2 (if the paradigm is 15 or 18), or
NULL
. - Comma-separated list of sources, or
NULL
, orREF
in case of references.
1 Used by Tēzaurs inflection service with the following parameters:
- http://api.tezaurs.lv/v1/inflections/{word}?paradigm={inflectionalParadigm}&stem1={infinitiveStem}&stem2={presentStem}&stem3={pastStem} for the paradigms 15 and 18;
- http://api.tezaurs.lv/v1/inflections/{word}?paradigm={inflectionalParadigm} for other paradigms;
- or no parameters - http://api.tezaurs.lv/v1/inflections/{word} - if you feel lucky.
2 To be used by http://api.tezaurs.lv/v1/transcriptions/{word}
See synonyms.txt
under wordlists
.
Data format: tab-separated records consisting of 2 fields:
- Headword.
- Comma-separated synonymic references.
Spektors, A., Auziņa, I., Darģis, R., Grūzītis, N., Paikens, P., Pretkalniņa, L., Rituma, L., Saulīte, B. Tezaurs.lv: the largest open lexical database for Latvian. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), 2016
Pretkalniņa, L., Paikens, P. Extending Tēzaurs.lv Online Dictionary into a Morphological Lexicon. Human Language Technologies - The Baltic Perspective. Frontiers in Artificial Intelligence and Applications, vol. 307, IOS Press, 2018
Paikens, P., Grūzītis, N., Rituma, L., Nešpore, G., Lipskis, V., Pretkalniņa, L., Spektors, A. Enriching an Explanatory Dictionary with FrameNet and PropBank Corpus Examples. Proceedings of the 6th Biennial Conference on Electronic Lexicography (eLex), 2019
- Tēzaurs API
- NLP-PIPE: Latvian NLP Tool Pipeline
- Full Stack of Latvian Language Resources for NLU and NLG
This work is partially supported by the Latvian State research programmes: Letonika (Project No. 3), NexIT (Project No. 1) and SOPHIS (Project No. 2). The latest development is supported by European Regional Development Fund under the grant agreement No. 1.1.1.1/16/A/219 (Full Stack of Language Resources for Natural Language Understanding and Generation in Latvian) and by the Latvian State research programme Latvian Language (VPP-IZM-2018/2-0002).
Tēzaurs data sets by AiLab are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Please, cite the relevant publications if you use Tēzaurs data or API in your research. Please, let us know if you use Tēzaurs data or API in your products or services. Your citations and feedback are important to secure funding for the further development of Tēzaurs data sets and API.
Project coordinator: Andrejs Spektors, [email protected]
Team members: Ilze Auziņa, Guntis Bārzdiņš, Roberts Darģis, Mikus Grasmanis, Normunds Grūzītis, Gunta Nešpore-Bērzkalne, Pēteris Paikens, Ilmārs Poikāns, Lauma Pretkalniņa, Laura Rituma, Baiba Valkovska (Saulīte), Artūrs Znotiņš