Improvements to corpus #27

Glitchy-Tozier · 2022-03-28T20:36:11Z

A random collection of ideas that would improve usage or optimization.

Add a README to /corpus, which explains where the sentences for those word-lists came from and what was done to them.
Add a README to every folder, explaining what files were used in what proportions (for example, with which percentages does "deu_mixed_wiki_web…" incorporate "wiki" and "web"
Include different countries for existing languages: Some collections (for example "web-public") separate the results into different countries.
- English: I suggest taking files from the US, Great Britain, and Australia and weighting them 1:1:1.
- German: As there's way less Austrians than Germans, we could tilt the weighting in favor of German corpora when adding "Austrian German". While I'd prefer going 1:1, I know many would not agree with this. Currently, the ratio of the two populations is roughly 9:1, so this might be an acceptable starting-point.
In http://www.adnw.de/index.php?n=Main.Bewertungsverfahren, multiple sources of n-gram frequencies are mentioned. It would be interesting to incorporate some of them, to further solidify the validity of our corpus.

The text was updated successfully, but these errors were encountered:

iandoug · 2022-10-19T13:17:57Z

Can't help with German but I did a corpus exercise last year for English. Get the latest versions of the PDF and .zip here:
https://zenodo.org/record/5501838

Also for Spanish, not as cleaned up as the English but better than straight from Leipzig:
https://zenodo.org/record/5501931

Glitchy-Tozier added the enhancement New feature or request label Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to corpus #27

Improvements to corpus #27

Glitchy-Tozier commented Mar 28, 2022 •

edited

Loading

iandoug commented Oct 19, 2022

Improvements to corpus #27

Improvements to corpus #27

Comments

Glitchy-Tozier commented Mar 28, 2022 • edited Loading

iandoug commented Oct 19, 2022

Glitchy-Tozier commented Mar 28, 2022 •

edited

Loading