Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Larger corpus? #6

Open
SpongebobSquamirez opened this issue Oct 26, 2018 · 2 comments
Open

Larger corpus? #6

SpongebobSquamirez opened this issue Oct 26, 2018 · 2 comments

Comments

@SpongebobSquamirez
Copy link

(This is a suggested improvement)

The corpus currently used is very small and seems to have just been thrown together by the original author (who called it "quick and dirty"). A larger corpus would be much appreciated, since the main problem with this library (which I've been using on-and-off for the past year, with mixed results) seems to be the small number of words it can detect (e.g. it couldn't even properly detect contractions before those were added to the corpus).

Something like the following might be good:
https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html
or
https://www.corpusdata.org/formats.asp

@keredson
Copy link
Owner

true. but i wanted to keep the default model small (<1M). (i actually think i pared down that original model somewhat, but my memory is fuzzy at this point). i'm open to better language models about the same size tho.

i just implemented importing your own model file. check out "Custom Language Models" in the readme.

@SpongebobSquamirez
Copy link
Author

Thanks for adding custom language models. Haven't tried it out yet but hopefully plugging in these new corpora is straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants