Not detecting word correctly? #126

ktbarrett · 2021-07-31T20:18:41Z

I have the word pybind11 in my text and have added pybind11 to my spelling list. However, it seems to ignore the trailing numbers and think the word is pybind only, creating a spelling error. I could add pybind to the spelling list, but then pybind12 would be valid, which isn't ideal. Why aren't the trailing numbers considered to be a part of the word? I didn't see any configuration option about changing that behavior.

The text was updated successfully, but these errors were encountered:

dhellmann · 2021-08-01T14:25:28Z

The pyenchant tokenizer does not consider digits as being part of the character set making up valid words, so they are removed from the tokens it returns.

There is an option to pass additional valid characters to the tokenizer, but that is not exposed by the sphinx integration layer. I don't think it would help in your case, because the word cannot start or end with those extra characters (the feature is meant to handle situations like apostrophes).

You could write a new tokenizer that did include digits in the set of characters making up valid words, but that might be extreme for your case. I think for that to work properly you would need to have the tokenizer accepted in pyenchant so that its name can be used with get_tokenizer().

It looks like pybind11 is the name of a package. The PyPI package name filter won't work because the trailing 11 are removed so the package name won't match the token. What I usually do in cases like that is either add the base package name to the spelling list or add literal markup around the name in the text to make the spell checker ignore it.

dmerejkowsky · 2021-09-09T17:24:37Z

Hi, PyEnchant maintainer here.

You could write a new tokenizer that did include digits in the set of characters making up valid words, but that might be extreme for your case

I concur. Also, writing a tokenizer is rarely fun.

What I usually do in cases like that is either add the base package name to the spelling list or add literal markup around the name in the text to make the spell checker ignore it.

I think this is the way to go :)

dhellmann · 2021-09-09T17:37:08Z

Thanks for the advice, @dmerejkowsky, and for maintaining PyEnchant!

webknjaz · 2021-12-01T01:46:36Z

I just want to add that when it sees words like isn't, sphinxcontrib.spellcheck reports isn as a mistake.
Also, I've hit a rather interesting case of dynamically generated identifiers in titles. In particular, I have a changelog document that renders a dynamically generated section title with 0.2.7.dev72+gb0796fb in it. This project then complains about dev and gb being invalid. I can ignore dev/alpha/beta to fix the first part but it's more complicated with the SCM identifier. The first letter g means it's a Git repository and the following characters represent an abbreviated commit SHA. So there's a lot of combinations with g+[1-9a-f] possible here and listing this many is not really sustainable.
@dhellmann do you think you could expose something to deal with it? Best if it could be specified per document or even more precise. I cannot even use a custom spelling_filters implementation when the tokenizer has a hard time recognizing the full version as a single word. Do you think you could fire an event maybe that would get a callback to modify the string before you pass it to pynchant? Maybe I'd just cut the word out in that callback and the rest would be linted.

P.S. Here's MyST Markdown code snippet that I have:

## [{sub-ref}`release`] - as of {sub-ref}`today` _{subscript}`/UNRELEASED DRAFT/`_

webknjaz · 2021-12-01T02:03:30Z

Just realized that the misspelled sequences may also be in the middle of the commit hash: 0.2.7.dev72+g3cdb8a9 has cdb reported as a mistake.

ktbarrett · 2021-12-01T02:18:14Z

The pyenchant tokenizer is basic in what it's capable of, but involves a lot of scaffolding code. Trying to further leverage that code is not a good way forward. A proper tokenizer solution would take advantage of user dictionaries to extend tokenization. Something like Aho-Corasick, and if it fails, backtracking and falling back on the current solution of language-specific lexeme matching and passing to enchant.

EDIT: Depending on the performance, startwith may suffice.
EDIT 2: startswith may have problems, it would have to check the word match is followed by a language-specific lexeme terminal character (space, punctuation, dashes, not numbers per the original issue, etc.) So I guess that means this functionality should be incorporated into pyenchant.

tgross35 · 2022-06-14T22:10:48Z

What would need to happen and (and in which repo) to get this to work? I'll happily code something if it means that "doesn't" and "ad-hoc" officially become words again

tgross35 · 2022-06-14T22:16:01Z

Looks like maybe the blocker is here pyenchant/pyenchant#286

@dmerejkowsky any chance you'd be able to take a look at that issue?

dhellmann · 2022-06-15T11:45:21Z

See #147 also, but yes, we need a supportable API to set the word separators.

tgross35 · 2022-06-15T14:44:43Z

Thanks for being so responsive. Too bad that it seems like pyenchant is kind of in coast mode now

dhellmann · 2022-06-15T15:17:25Z

I think the main issue is that none of the maintainers of either project have a real need for the feature, so we're prioritizing other things.

There's some advice in this ticket and the PR I linked that could lead someone to implement something and propose it to PyEnchant. It would also be possible to wrap the existing tokenizer and recombine split contractions (for example, recognize "don" and "t" as consecutive tokens and emit "don't" as one token). It's a bit hacky, but ought to work, and that sort of implementation could live independently of either project, and we could provide an interface for configuring it through Sphinx.

tgross35 · 2022-06-15T20:12:11Z

Interesting ideas, thank you for the insight. A PR against pyenchant would definitely be best, seems like maybe they need some maintenance help getting through issues/PRs though (any chance you'd be interested, if Dimitri offered? You're active and sure know your tokenization stuff:) ).

I could volunteer to code something (seems like stevepiercy might have offered as well) but just want to make sure not to duplicate efforts between a wrapper and a PR without getting some input from pyenchant first

dhellmann · 2022-06-15T20:53:25Z

I'm not taking on new projects right now, so you won't be duplicating anything I'm doing. If you're interested, coordinating with the other folks who have expressed interest is a good idea.

dmerejkowsky · 2022-06-22T10:22:10Z

[They] need some maintenance help getting through issues/PRs though

Actually, I'm the sole maintainer of PyEnchant right now. This was not so bad for a long time because I was using PyEnchant very regularly - it's how I implemented a better spell checker for Kakoune. But nowadays I use yet another spellchecker for Kakoune called skyspell, which means I no longer use PyEnchant ...

So yeah, it would be best for PyEnchant (and its downstream projects like this one) if someone would volunteer to help maintain PyEnchant. Feel free to contact me for the details.

dhellmann · 2022-06-22T12:49:55Z

So yeah, it would be best for PyEnchant (and its downstream projects like this one) if someone would volunteer to help maintain PyEnchant. Feel free to contact me for the details.

Thanks, @dmerejkowsky

ktbarrett mentioned this issue Aug 28, 2021

Trailing numbers not consider part of a word pyenchant/pyenchant#254

Closed

dhellmann mentioned this issue Sep 12, 2021

Hyphenated words flag as misspelled even when in spelling_word_list_filename #119

Open

dhellmann mentioned this issue Feb 8, 2022

Apostrophe contractions are not recognized #146

Closed

wxtim mentioned this issue Feb 16, 2022

automatic check for spelling consistency? cylc/cylc-doc#319

Closed

dhellmann added the enhancement label May 25, 2022

dmerejkowsky mentioned this issue Jun 22, 2022

Maintainers wanted pyenchant/pyenchant#293

Open

Taragolis mentioned this issue Jan 27, 2023

Fix false-positive spellcheck failure apache/airflow#29190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not detecting word correctly? #126

Not detecting word correctly? #126

ktbarrett commented Jul 31, 2021

dhellmann commented Aug 1, 2021

dmerejkowsky commented Sep 9, 2021 •

edited

Loading

dhellmann commented Sep 9, 2021

webknjaz commented Dec 1, 2021

webknjaz commented Dec 1, 2021

ktbarrett commented Dec 1, 2021 •

edited

Loading

tgross35 commented Jun 14, 2022

tgross35 commented Jun 14, 2022

dhellmann commented Jun 15, 2022

tgross35 commented Jun 15, 2022

dhellmann commented Jun 15, 2022

tgross35 commented Jun 15, 2022

dhellmann commented Jun 15, 2022

dmerejkowsky commented Jun 22, 2022

dhellmann commented Jun 22, 2022

Not detecting word correctly? #126

Not detecting word correctly? #126

Comments

ktbarrett commented Jul 31, 2021

dhellmann commented Aug 1, 2021

dmerejkowsky commented Sep 9, 2021 • edited Loading

dhellmann commented Sep 9, 2021

webknjaz commented Dec 1, 2021

webknjaz commented Dec 1, 2021

ktbarrett commented Dec 1, 2021 • edited Loading

tgross35 commented Jun 14, 2022

tgross35 commented Jun 14, 2022

dhellmann commented Jun 15, 2022

tgross35 commented Jun 15, 2022

dhellmann commented Jun 15, 2022

tgross35 commented Jun 15, 2022

dhellmann commented Jun 15, 2022

dmerejkowsky commented Jun 22, 2022

dhellmann commented Jun 22, 2022

dmerejkowsky commented Sep 9, 2021 •

edited

Loading

ktbarrett commented Dec 1, 2021 •

edited

Loading