Tēzaurs

Open data from https://tezaurs.lv - an extensive dictionary and thesaurus of Latvian, comprising more than 315,000 lexical entries.

This is unmaintained - the latest data releases are on CLARIN repsitory at https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/66

Available datasets

Wordlists with metadata (under wordlists).
Synonymic references (under wordlists).
Glosses, etc. (under entries).

Additional datasets

Multi-word expressions (extraced from a balanced 10M text corpus of Latvian) - under mwe.
Mapping of Tēzaurs entries to core WordNet synsets (experimental) - under wordnet.

Wordlists

Morphology and other metadata

See entries.txt and references.txt under wordlists. Entries is a list of main headwords. References is a list of derivatives of the main headwords. Named entities, acronyms, abbreviations, prefixes, etc. are not included.

Data format: tab-separated records consisting of 9 fields:

Headword.
Homonym / homograph index (0..N).
Universal POS tag, or NULL.
Inflectional paradigm¹ (0..N), or NULL.
Infinitive stem¹ (if the paradigm is 15 or 18), or NULL.
Comma-separated present stems¹ (if the paradigm is 15 or 18), or NULL.
Comma-separated past stems¹ (if the paradigm is 15 or 18), or NULL.
Verb prefix² (if the paradigm is 15 or 18), or NULL.
Comma-separated list of sources, or NULL, or REF in case of references.

¹ Used by Tēzaurs inflection service with the following parameters:

http://api.tezaurs.lv/v1/inflections/{word}?paradigm={inflectionalParadigm}&stem1={infinitiveStem}&stem2={presentStem}&stem3={pastStem} for the paradigms 15 and 18;
http://api.tezaurs.lv/v1/inflections/{word}?paradigm={inflectionalParadigm} for other paradigms;
or no parameters - http://api.tezaurs.lv/v1/inflections/{word} - if you feel lucky.

² To be used by http://api.tezaurs.lv/v1/transcriptions/{word}

Synonyms

See synonyms.txt under wordlists.

Data format: tab-separated records consisting of 2 fields:

Headword.
Comma-separated synonymic references.

Publications

Spektors, A., Auziņa, I., Darģis, R., Grūzītis, N., Paikens, P., Pretkalniņa, L., Rituma, L., Saulīte, B. Tezaurs.lv: the largest open lexical database for Latvian. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), 2016

Pretkalniņa, L., Paikens, P. Extending Tēzaurs.lv Online Dictionary into a Morphological Lexicon. Human Language Technologies - The Baltic Perspective. Frontiers in Artificial Intelligence and Applications, vol. 307, IOS Press, 2018

Paikens, P., Grūzītis, N., Rituma, L., Nešpore, G., Lipskis, V., Pretkalniņa, L., Spektors, A. Enriching an Explanatory Dictionary with FrameNet and PropBank Corpus Examples. Proceedings of the 6th Biennial Conference on Electronic Lexicography (eLex), 2019

Related work

Acknowledgements

This work is partially supported by the Latvian State research programmes: Letonika (Project No. 3), NexIT (Project No. 1) and SOPHIS (Project No. 2). The latest development is supported by European Regional Development Fund under the grant agreement No. 1.1.1.1/16/A/219 (Full Stack of Language Resources for Natural Language Understanding and Generation in Latvian) and by the Latvian State research programme Latvian Language (VPP-IZM-2018/2-0002).

Licence

Tēzaurs data sets by AiLab are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Please, cite the relevant publications if you use Tēzaurs data or API in your research. Please, let us know if you use Tēzaurs data or API in your products or services. Your citations and feedback are important to secure funding for the further development of Tēzaurs data sets and API.

Contacts

Project coordinator: Andrejs Spektors, [email protected]

Team members: Ilze Auziņa, Guntis Bārzdiņš, Roberts Darģis, Mikus Grasmanis, Normunds Grūzītis, Gunta Nešpore-Bērzkalne, Pēteris Paikens, Ilmārs Poikāns, Lauma Pretkalniņa, Laura Rituma, Baiba Valkovska (Saulīte), Artūrs Znotiņš

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tēzaurs

This is unmaintained - the latest data releases are on CLARIN repsitory at https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/66

Available datasets

Additional datasets

Wordlists

Morphology and other metadata

Synonyms

Publications

Related work

Acknowledgements

Licence

Contacts

About

Releases

Packages

Contributors 3

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
entries		entries
mwe		mwe
wordlists		wordlists
wordnet		wordnet
README.md		README.md

LUMII-AILab/Tezaurs

Folders and files

Latest commit

History

Repository files navigation

Tēzaurs

This is unmaintained - the latest data releases are on CLARIN repsitory at https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/66

Available datasets

Additional datasets

Wordlists

Morphology and other metadata

Synonyms

Publications

Related work

Acknowledgements

Licence

Contacts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages