-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve spell checker tokenization and reporting #72924
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto-requesting reviews from non-collaborators: @LyleSY @bombasticSlacks @Hymore246 @Standing-Storm @Night-Pryanik @Venera3 @jbytheway
9d1da15
to
6e8e446
Compare
Spell checker encountered unrecognized words in the in-game text added in this pull request. See below for details. Click to expand
This alert is automatically generated. You can simply disregard if this is inaccurate, or (optionally) you can also add the new words to Hints for adding a new word to the dictionary
|
Update the spelling dictionary for #72924
Summary
Infrastructure "Improve spell checker tokenization and reporting"
Purpose of change
The current spell checker does not handle words with apostrophes correctly. For example, in #72910 (comment) it reports the
isn
inisn't
.The spell checker is also not case-sensitive, and it seems the default dictionary from
pyspellchecker
only contains non-proper-nouns.Describe the solution
The spell checker's tokenizer is improved to consider apostrophe as part of a word, unless the apostrophe starts or ends a word or is followed by an
s
that ends the word. In the former case, it is difficult to distinguish the apostrophe from a single quote; in the latter case, the full word is genitive or an abbreviation ofxxx is/has
, so checking the part before's
should be enough.The spell checker code now also takes the case of the word into account. For any all-lowercase word in the dictionary, the original, initial-caps, or all-uppercase form is considered correct; for any other word the original and all-uppercase form is considered correct.
Describe alternatives you've considered
Testing
Tested locally and was able to fix spelling mistakes in #72910 without getting a million reports on
isn
. The spell checker also distinguished words with different cases, so I was able to fix words with incorrect capitalization in #72910.The spell checker github action is tested by running it on top of #72910 in this PR and results in #72924 (comment). Words not in the dictionary are correctly reported, apostrophes are correctly handled, and capitalized/all-uppercase words are correctly handled. Note that it reports some correctly spelled words because the updated dictionary is not added in this PR.
Additional context
The updated dictionary contains a lot of changed lines so it will be submitted as a separate PR. This PR should be usable with the existing dictionary though.