Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not detecting word correctly? #126

Open
ktbarrett opened this issue Jul 31, 2021 · 15 comments
Open

Not detecting word correctly? #126

ktbarrett opened this issue Jul 31, 2021 · 15 comments

Comments

@ktbarrett
Copy link

I have the word pybind11 in my text and have added pybind11 to my spelling list. However, it seems to ignore the trailing numbers and think the word is pybind only, creating a spelling error. I could add pybind to the spelling list, but then pybind12 would be valid, which isn't ideal. Why aren't the trailing numbers considered to be a part of the word? I didn't see any configuration option about changing that behavior.

@dhellmann
Copy link
Member

The pyenchant tokenizer does not consider digits as being part of the character set making up valid words, so they are removed from the tokens it returns.

There is an option to pass additional valid characters to the tokenizer, but that is not exposed by the sphinx integration layer. I don't think it would help in your case, because the word cannot start or end with those extra characters (the feature is meant to handle situations like apostrophes).

You could write a new tokenizer that did include digits in the set of characters making up valid words, but that might be extreme for your case. I think for that to work properly you would need to have the tokenizer accepted in pyenchant so that its name can be used with get_tokenizer().

It looks like pybind11 is the name of a package. The PyPI package name filter won't work because the trailing 11 are removed so the package name won't match the token. What I usually do in cases like that is either add the base package name to the spelling list or add literal markup around the name in the text to make the spell checker ignore it.

@dmerejkowsky
Copy link

dmerejkowsky commented Sep 9, 2021

Hi, PyEnchant maintainer here.

You could write a new tokenizer that did include digits in the set of characters making up valid words, but that might be extreme for your case

I concur. Also, writing a tokenizer is rarely fun.

What I usually do in cases like that is either add the base package name to the spelling list or add literal markup around the name in the text to make the spell checker ignore it.

I think this is the way to go :)

@dhellmann
Copy link
Member

Thanks for the advice, @dmerejkowsky, and for maintaining PyEnchant!

@webknjaz
Copy link
Member

webknjaz commented Dec 1, 2021

I just want to add that when it sees words like isn't, sphinxcontrib.spellcheck reports isn as a mistake.
Also, I've hit a rather interesting case of dynamically generated identifiers in titles. In particular, I have a changelog document that renders a dynamically generated section title with 0.2.7.dev72+gb0796fb in it. This project then complains about dev and gb being invalid. I can ignore dev/alpha/beta to fix the first part but it's more complicated with the SCM identifier. The first letter g means it's a Git repository and the following characters represent an abbreviated commit SHA. So there's a lot of combinations with g+[1-9a-f] possible here and listing this many is not really sustainable.
@dhellmann do you think you could expose something to deal with it? Best if it could be specified per document or even more precise. I cannot even use a custom spelling_filters implementation when the tokenizer has a hard time recognizing the full version as a single word. Do you think you could fire an event maybe that would get a callback to modify the string before you pass it to pynchant? Maybe I'd just cut the word out in that callback and the rest would be linted.

P.S. Here's MyST Markdown code snippet that I have:

## [{sub-ref}`release`] - as of {sub-ref}`today` _{subscript}`/UNRELEASED DRAFT/`_

@webknjaz
Copy link
Member

webknjaz commented Dec 1, 2021

Just realized that the misspelled sequences may also be in the middle of the commit hash: 0.2.7.dev72+g3cdb8a9 has cdb reported as a mistake.

@ktbarrett
Copy link
Author

ktbarrett commented Dec 1, 2021

The pyenchant tokenizer is basic in what it's capable of, but involves a lot of scaffolding code. Trying to further leverage that code is not a good way forward. A proper tokenizer solution would take advantage of user dictionaries to extend tokenization. Something like Aho-Corasick, and if it fails, backtracking and falling back on the current solution of language-specific lexeme matching and passing to enchant.

EDIT: Depending on the performance, startwith may suffice.
EDIT 2: startswith may have problems, it would have to check the word match is followed by a language-specific lexeme terminal character (space, punctuation, dashes, not numbers per the original issue, etc.) So I guess that means this functionality should be incorporated into pyenchant.

@tgross35
Copy link

What would need to happen and (and in which repo) to get this to work? I'll happily code something if it means that "doesn't" and "ad-hoc" officially become words again

@tgross35
Copy link

Looks like maybe the blocker is here pyenchant/pyenchant#286

@dmerejkowsky any chance you'd be able to take a look at that issue?

@dhellmann
Copy link
Member

See #147 also, but yes, we need a supportable API to set the word separators.

@tgross35
Copy link

Thanks for being so responsive. Too bad that it seems like pyenchant is kind of in coast mode now

@dhellmann
Copy link
Member

I think the main issue is that none of the maintainers of either project have a real need for the feature, so we're prioritizing other things.

There's some advice in this ticket and the PR I linked that could lead someone to implement something and propose it to PyEnchant. It would also be possible to wrap the existing tokenizer and recombine split contractions (for example, recognize "don" and "t" as consecutive tokens and emit "don't" as one token). It's a bit hacky, but ought to work, and that sort of implementation could live independently of either project, and we could provide an interface for configuring it through Sphinx.

@tgross35
Copy link

Interesting ideas, thank you for the insight. A PR against pyenchant would definitely be best, seems like maybe they need some maintenance help getting through issues/PRs though (any chance you'd be interested, if Dimitri offered? You're active and sure know your tokenization stuff:) ).

I could volunteer to code something (seems like stevepiercy might have offered as well) but just want to make sure not to duplicate efforts between a wrapper and a PR without getting some input from pyenchant first

@dhellmann
Copy link
Member

I'm not taking on new projects right now, so you won't be duplicating anything I'm doing. If you're interested, coordinating with the other folks who have expressed interest is a good idea.

@dmerejkowsky
Copy link

[They] need some maintenance help getting through issues/PRs though

Actually, I'm the sole maintainer of PyEnchant right now. This was not so bad for a long time because I was using PyEnchant very regularly - it's how I implemented a better spell checker for Kakoune. But nowadays I use yet another spellchecker for Kakoune called skyspell, which means I no longer use PyEnchant ...

So yeah, it would be best for PyEnchant (and its downstream projects like this one) if someone would volunteer to help maintain PyEnchant. Feel free to contact me for the details.

@dhellmann
Copy link
Member

So yeah, it would be best for PyEnchant (and its downstream projects like this one) if someone would volunteer to help maintain PyEnchant. Feel free to contact me for the details.

Thanks, @dmerejkowsky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants