Skip to content

LUMII-AILab/Tezaurs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tēzaurs

Open data from https://tezaurs.lv - an extensive dictionary and thesaurus of Latvian, comprising more than 315,000 lexical entries.

This is unmaintained - the latest data releases are on CLARIN repsitory at https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/66

Available datasets

  1. Wordlists with metadata (under wordlists).
  2. Synonymic references (under wordlists).
  3. Glosses, etc. (under entries).

Additional datasets

  1. Multi-word expressions (extraced from a balanced 10M text corpus of Latvian) - under mwe.
  2. Mapping of Tēzaurs entries to core WordNet synsets (experimental) - under wordnet.

Wordlists

Morphology and other metadata

See entries.txt and references.txt under wordlists. Entries is a list of main headwords. References is a list of derivatives of the main headwords. Named entities, acronyms, abbreviations, prefixes, etc. are not included.

Data format: tab-separated records consisting of 9 fields:

  1. Headword.
  2. Homonym / homograph index (0..N).
  3. Universal POS tag, or NULL.
  4. Inflectional paradigm1 (0..N), or NULL.
  5. Infinitive stem1 (if the paradigm is 15 or 18), or NULL.
  6. Comma-separated present stems1 (if the paradigm is 15 or 18), or NULL.
  7. Comma-separated past stems1 (if the paradigm is 15 or 18), or NULL.
  8. Verb prefix2 (if the paradigm is 15 or 18), or NULL.
  9. Comma-separated list of sources, or NULL, or REF in case of references.

1 Used by Tēzaurs inflection service with the following parameters:

2 To be used by http://api.tezaurs.lv/v1/transcriptions/{word}

Synonyms

See synonyms.txt under wordlists.

Data format: tab-separated records consisting of 2 fields:

  1. Headword.
  2. Comma-separated synonymic references.

Publications

Spektors, A., Auziņa, I., Darģis, R., Grūzītis, N., Paikens, P., Pretkalniņa, L., Rituma, L., Saulīte, B. Tezaurs.lv: the largest open lexical database for Latvian. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), 2016

Pretkalniņa, L., Paikens, P. Extending Tēzaurs.lv Online Dictionary into a Morphological Lexicon. Human Language Technologies - The Baltic Perspective. Frontiers in Artificial Intelligence and Applications, vol. 307, IOS Press, 2018

Paikens, P., Grūzītis, N., Rituma, L., Nešpore, G., Lipskis, V., Pretkalniņa, L., Spektors, A. Enriching an Explanatory Dictionary with FrameNet and PropBank Corpus Examples. Proceedings of the 6th Biennial Conference on Electronic Lexicography (eLex), 2019

Related work

Acknowledgements

This work is partially supported by the Latvian State research programmes: Letonika (Project No. 3), NexIT (Project No. 1) and SOPHIS (Project No. 2). The latest development is supported by European Regional Development Fund under the grant agreement No. 1.1.1.1/16/A/219 (Full Stack of Language Resources for Natural Language Understanding and Generation in Latvian) and by the Latvian State research programme Latvian Language (VPP-IZM-2018/2-0002).

Licence

Tēzaurs data sets by AiLab are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Please, cite the relevant publications if you use Tēzaurs data or API in your research. Please, let us know if you use Tēzaurs data or API in your products or services. Your citations and feedback are important to secure funding for the further development of Tēzaurs data sets and API.

Contacts

Project coordinator: Andrejs Spektors, [email protected]

Team members: Ilze Auziņa, Guntis Bārzdiņš, Roberts Darģis, Mikus Grasmanis, Normunds Grūzītis, Gunta Nešpore-Bērzkalne, Pēteris Paikens, Ilmārs Poikāns, Lauma Pretkalniņa, Laura Rituma, Baiba Valkovska (Saulīte), Artūrs Znotiņš

About

The largest open lexical database of Latvian

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published