Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

divvunspell having problems with Cyrillic capital letters (it seems) #39

Closed
Trondtr opened this issue May 10, 2024 · 11 comments
Closed
Assignees

Comments

@Trondtr
Copy link

Trondtr commented May 10, 2024

To repeat:

uit-mac-443 lang-mns (main)$ e Эспоо|humnsNorm
Эспоо	Эспоо+N+Prop+Sem/Plc+Sg+Nom	0,000000

uit-mac-443 lang-mns (main)$ e Эспоо|divvunspell suggest -a tools/spellcheckers/mns.zhfst 
Reading from stdin...
Input: Эспоо		[INCORRECT]

uit-mac-443 lang-mns (main)$ e Эспоо|hfst-ospell -S -n 5 tools/spellcheckers/mns.zhfst 
"Эспоо" is in the lexicon...

The form Эспоо is in the normative fst, and is recognised by hfst-ospell, but not by divvunspell, even though both spellers access the same freshly compiled version of the speller, mns.zhfst.

The lemma (Эспоо) is from the shared-urj-Cyrl, but that should not be relevant (it is in the resulting fst). The editdist.default.txt file declares the е/э pair (and, on a side note, also the composed long э̄). Capital letters are not declared explicitly:

uit-mac-443 lang-mns (main)$ grep э tools/spellcheckers/editdist.default.txt 
э
э̄
е	э	2
э	е	2
 
@Trondtr
Copy link
Author

Trondtr commented May 10, 2024

More examples:

Ленинградский
Ленинградский	Ленинградский+N+Prop+Sem/Plc+Sg+Nom	0,000000

Лениӈрадский
Лениӈрадский	Лениӈрадский+?	inf

uit-mac-443 lang-mns (main)$ e Ленинградский|hfst-ospell -S -n 5 tools/spellcheckers/mns.zhfst 
"Ленинградский" is in the lexicon...
uit-mac-443 lang-mns (main)$ e Ленинградский|divvunspell suggest -a tools/spellcheckers/mns.zhfst 
Reading from stdin...
Input: Ленинградский		[INCORRECT]

Thus, hfst-lookup and hfst-ospell behave as expected, divvunspell does not (it does not recognise the correct form, and thus also does not suggest it as a correction for the incorrect form.

More examples behaving in the same way:

Приобье, Сарымо-Русскинское, Сӯкыръя, Тӯрват, хо̄тпан, ва̄та̄л-хащта̄л, вертолёт, занятиен, школа, колхозыт, коньяк, ляпат

As can be seen, the examples are both plain small-letter Russian Cyrillics as well as capital letters and composed letters. These failed forms constitute appr. 5% of a dataset of 607 word pairs.

@Trondtr
Copy link
Author

Trondtr commented May 29, 2024

Exactly the same issue has been commented upon earlier, in #19 , some years ago. The problems (capital letters and composed letters) seem unrelated, but taken together, they lead to spellers for Cyrillic-based languages and languages with composed letters (a large part of our languages) being dysfunctional. I thus hope the issue can get some attention.

@flammie
Copy link
Contributor

flammie commented May 30, 2024

This is an old problem that was probably patched over in hfst-ospell after it was used as a a source for divvunspell; if errmodel doesn't know the alphabets needed for lexicon it gets confused c.f. giellalt/lang-mns@37e7724

@Trondtr
Copy link
Author

Trondtr commented Jun 4, 2024

But how come? Why doesn't the error model know the alphatet needed (i.e., I take you to mean: the capital letters?). There is nothing magic with cyrillic letters per se, I thought, but evidently it is. Here is the situation:

  • sme: editdistance.default.regex: Latin. Contains no capital letters, only small. Still names (Trond) are recognised
  • mhr: editdistance.default.txt: Cyrillic. contains capital and small letters. Names (Ленинград) are recognised
  • mns: editdistance.default.txt: Cyrillic. contains no capital letters. Names (Ленинград) are NOT recognised

Looking hard at this, I adjusted mns in line with mns, and now it works. I thus close the bug.

It still is a mystery how sme works without having declared capital letters. Admittedly, it uses .regex and not .txt, but I do not see how that should affect the outcome.

In any case, although I invite anyone puzzled by the remaining incongruence to delve into it, I now have a working divvunspell for Cyrillic-based languages, and close the bug.

@Trondtr Trondtr closed this as completed Jun 4, 2024
@flammie
Copy link
Contributor

flammie commented Jun 4, 2024

But how come? Why doesn't the error model know the alphatet needed (i.e., I take you to mean: the capital letters?). There is nothing magic with cyrillic letters per se, I thought, but evidently it is. Here is the situation:

* sme: editdistance.default.regex: Latin. Contains no capital letters, only small. Still names (Trond) are recognised

There's no editdistance.default.regex in https://github.com/giellalt/lang-sme/tree/main/tools/spellcheckers, editdistance.default-new.regex has uppercasing rules. editdist.default-old.txt also has upper cases. There's usually one part or other in whole error model that happens to have upper case letters (it can be even only at strings things or whatever).

I actually wrote a hfst-tool hfst-check-alpha few years ago to tackle this problem, it could be automated a bit even though it started as a debug tool.

@snomos
Copy link
Member

snomos commented Jun 4, 2024

But how come? Why doesn't the error model know the alphatet needed (i.e., I take you to mean: the capital letters?).

We don't deduce the alphabet from the lexical FST automatically. We did so in an early version of the speller infra, but then we instead needed to list all the symbols we did NOT want as part of the error model alphabet. That turned out to be even more confusing and counter-intuitive, so what we have now is a system where you need to explicitly (and in some cases implicitly) list all symbols/letters you want to include in the error model alphabet.

By default I always suggest to NOT include capital letters. Including them leads to a much bigger error model, and similarly slower speller. At the same time we know that capital letters are almost only found in first position, and letters in that position are (generally) rarely wrong. There is also built-in automatic processing of upper-lower case shifting in the speller code, so that it is only the lexical case that needs consideration.

All of this is to say that actual need for including upper-case letters in the error model alphabet is usually very small, so small that it is typically not worth the costs (very much bigger error model, and much slower speller, cf above).

That is, think twice before adding upper case letters, and if you need to, consider whether you need all the upper case letters, or only a select few and known problem letters, as in your example.

@Trondtr
Copy link
Author

Trondtr commented Jun 4, 2024

Well.

Before adding the capital letters the speller did not recognise words written with capital letters (cf. the "to repeat" in my first posting. After adding the capital letters, the speller did recognise such words.

Some letters never occur word-initially, but they do occur in words written with capital letters only. I am thus hesitant to follow your advice here.

@snomos
Copy link
Member

snomos commented Jun 4, 2024

Before adding the capital letters the speller did not recognise words written with capital letters (cf. the "to repeat" in my first posting. After adding the capital letters, the speller did recognise such words.

That does not make any sense. Adding letters to the error model should have no consequence for whether a word is accepted or not by the speller, since that is done by the acceptor only. The error model takes no part in this process (and should not).

If you can come up with a reproducible example demonstrating this behaviour, please add it here (or in a new issue) — that is definitely a serious bug, and should be fixed ASAP. But I don't believe it before I see it.

Some letters never occur word-initially, but they do occur in words written with capital letters only. I am thus hesitant to follow your advice here.

I wrote in my comment:

There is also built-in automatic processing of upper-lower case shifting in the speller code, so that it is only the lexical case that needs consideration.

This includes initial upper-casing and all-caps. That is, from an FST point of view, word and WORD are exactly the same in divvunspell (as long as word is defined in the FST), because WORD (as well as Word) are changed to word and checked against the acceptor before being flagged as misspellings, and if word is then ok, both Word and WORD are also ok.

When generating suggestions, all of word, Word and WORD are used as input to the error model.

@Trondtr
Copy link
Author

Trondtr commented Jun 4, 2024

Not too exciting here, just repeating myself, but hte reversed order. I first take the git version of the mns speller, and run the word for Leningrad (capital cyrillic L) through it. It is accepted by divvunspell:

uit-mac-443 lang-mns (main)$ e Ленинград| divvunspell suggest -a tools/spellcheckers/mns.zhfst 
Reading from stdin...
Input: Ленинград		[CORRECT]

I then remove lines 46 through 89 (= all the capital letters) from the file editdist.default.txt, and save the file (I do not check it in, but it can be repeated as just described), and recompile the speller:


uit-mac-443 lang-mns (main)$ see tools/spellcheckers/editdist.default.txt 
uit-mac-443 lang-mns (main)$ make -j

*** Compiling mns - Mansi. ***

I then repeat the same test, with the following result:

uit-mac-443 lang-mns (main)$ e Ленинград| divvunspell suggest -a tools/spellcheckers/mns.zhfst 
Reading from stdin...
Input: Ленинград		[INCORRECT]

The only difference is thus the removal of the capital letters from editdistance.default.txt.

The word is recognised by the analyser:

uit-mac-443 lang-mns (main)$ e Ленинград| humns
Ленинград	Ленинград+N+Prop+Sem/Plc+Sg+Nom	0,000000

@Trondtr
Copy link
Author

Trondtr commented Jun 4, 2024

There are subtle differences, though:

  • cyrillic-based mns: Propernoun works, but common word with initial capital or all capital do not
  • latin-based sma: Propernoun works, common word with initial capital works, but word with all capital does not

Whether this is of any help I do not know.

@flammie
Copy link
Contributor

flammie commented Jun 5, 2024

I think Trond is right here, divvunspell fails when the, e.g. uppercase, alphabet does not exist in the alphabet of the whole error model.

I have been wondering why there is an extra case handling and extra edit distance re-weighting step in divvunspell (vs hfst-ospell) that seemingly reduplicates what error model should be doing and I think now that it might have been implemented to partially workaround this problem..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants