Skip to content

Repo to test several language ID tools on several types of texts

Notifications You must be signed in to change notification settings

ec-doris/language-identification-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

language-identification-benchmark

Repo to test several language ID tools on several types of texts. See https://github.com/ec-doris/drivein-cdk/issues/269.

SOTA and survey for language identification is available in these PhD thesis and survey paper.

Current tools

Not considered for now:

Use

python main.py $COLLECTION

where $COLLECTION is currently kohesio, emea, eubooks, europarl, subs, wikimatrix

Results will be plotted in results

Data

Data can be downloaded with python download.py, although it is included in the repo. The script will shuffle sentences so results will probably vary.

References

About

Repo to test several language ID tools on several types of texts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages