Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The word split is kinda too aggressive #9

Open
shushi2016 opened this issue Aug 16, 2019 · 2 comments
Open

The word split is kinda too aggressive #9

shushi2016 opened this issue Aug 16, 2019 · 2 comments

Comments

@shushi2016
Copy link

Thanks for this great work. I tried it out and found that the split is sometimes too aggressive to me, for example, the 'occupational' is split into 'occ', 'u', 'p', 'a', 't' and ional', and 'particulate' into 'part', 'icu', 'late'. Strangely it's not always like this - sometimes I can get 'occupational' and 'particulate' correctly. Any thoughts about why this happens?

@keredson
Copy link
Owner

Hey Shi,

Not really sure. For those words I'm seeing:

>>> import wordninja
>>> wordninja.split('occupational')
['occupational']
>>> wordninja.split('particulate')
['particulate']

Are those the exact strings you're trying?

The algo is deterministic. It shouldn't be giving you different outcomes for the same strings multiple calls.

Derek

@SpongebobSquamirez
Copy link

I don't remember correctly anymore, but I also ran into this issue a while back (if I recall I solved the problem). I think it had something to do with either ligatures (like the combined fi character), trying to split really really long strings (which gave a bug for some reason), or with one word having a typo/one letter being missing at the beginning or end of the string. See if maybe one of those is the cause of your problem. NLP is not fun.

Using something like hunspell or some other correction with levenshtein distance might help fix any rogue character issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants