This directory contains various linguistic frequency distributions, represented by two-column TSV files where the first column is the linguistic representation (usually, though not always, a token or word) and the second its frequency.
The LNRE calculator can ingest these files and produce useful descriptive statistics.
- Tweets by
@dril
:- Token frequencies, summary, graph
- Yahoo! Horoscopes, 2010:
- Token unigram frequencies, summary, graph
- Token bigram frequencies, summary, graph
- Token trigram frequencies, summary, graph
- The bible, King James Version:
- Token frequencies, summary, graph
- English News Crawl, 2017:
- Token frequencies, summary, graph
- English syntax from the Wall St. Journal portion of the Penn Treebank:
- Word/XPOS ("emissions") frequencies, summary, graph
- Binarized and lexicalized (v = 1, h = 1) CFG rule ("production rule") frequencies, summary, graph
- English syntax from the English Web Treebank:
- Word/dependency relation pair frequencies, summary, graph
- Word/headword pair ("bilexical dependency") frequencies, summary, graph
- Word/head UPOS pair frequencies, summary, graph
- UPOS/headword pair frequencies, summary, graph
- Czech morphology from Prague Dependency Treebank:
- Token frequencies, summary, graph
- Lemma frequencies, summary, graph
- XPOS frequencies, summary, graph
- UPOS frequencies, summary, graph
- Universal Dependencies morphology tag frequencies, summary, graph
- UniMorph morphology tag frequencies, summary, graph
- French phonology from Lexique:
- Phoneme frequencies, summary, graph