You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A random collection of ideas that would improve usage or optimization.
Add a README to /corpus, which explains where the sentences for those word-lists came from and what was done to them.
Add a README to every folder, explaining what files were used in what proportions (for example, with which percentages does "deu_mixed_wiki_web…" incorporate "wiki" and "web"
Include different countries for existing languages: Some collections (for example "web-public") separate the results into different countries.
English: I suggest taking files from the US, Great Britain, and Australia and weighting them 1:1:1.
German: As there's way less Austrians than Germans, we could tilt the weighting in favor of German corpora when adding "Austrian German". While I'd prefer going 1:1, I know many would not agree with this. Currently, the ratio of the two populations is roughly 9:1, so this might be an acceptable starting-point.
Can't help with German but I did a corpus exercise last year for English. Get the latest versions of the PDF and .zip here: https://zenodo.org/record/5501838
A random collection of ideas that would improve usage or optimization.
README
to every folder, explaining what files were used in what proportions (for example, with which percentages does "deu_mixed_wiki_web…" incorporate "wiki" and "web"The text was updated successfully, but these errors were encountered: