-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speller suggestion issue #3
Comments
The interesting thing happens when you use echo а̄им | divvunspell suggest -a tools/spellcheckers/mns.zhfst
Reading from stdin...
Input: а̄им [INCORRECT]
аим 17.787788
а̄тим 24.199446
а̄гим 25.03525
аким 26.155512
ам 27.999014
агим 28.65959
атым 32.826374
аюм 34.254124
аи 43.7433
аис 43.966442 For comparison, echo '5 а̄им' | hfst-ospell-office -d tools/spellcheckers/mns.zhfst
@@ Loading tools/spellcheckers/mns.zhfst with args max-weight=-1.00, beam=-1.00, time-cutoff=6.00
@@ hfst-ospell-office is alive
& вим (20.06;0) аим (20.79;0) йим (21.74;0) мим (22.25;0) сым (22.52;0) |
@flammie do you have comments or insights re combining diacritics and In any case: I am not sure how much time we should spend on this, given that it works correct using |
yeah it seems quite fragile here at compilation of separate error model part already: $ echo 'а̄' | hfst-lookup .generated/strings.all.default.hfst
hfst-lookup: warning: It is not possible to perform fast lookups with OpenFST, std arc, tropical semiring format automata.
Using HFST basic transducer format and performing slow lookups
> а̄ а̄̄ 1,000000
а̄ а 2,000000
$ hfst-fst2txt .generated/strings.all.default.hfst | fgrep а
0 0 а а 0.000000
1 21 а а 0.000000
2 43 я а 0.000000
41 44 я а 0.000000
58 58 а а 0.000000
$ hfst-fst2txt .generated/strings.all.default.hfst | fgrep а̄
$ hfst-fst2txt .generated/strings.all.default.hfst | fgrep $'\u0304'
0 0 0.000000
7 8 0.000000
13 14 @0@ 0.000000
13 27 @0@ 0.000000
15 16 @0@ 0.000000
15 28 @0@ 0.000000
17 18 @0@ 0.000000
17 29 @0@ 0.000000
19 20 @0@ 0.000000
19 30 @0@ 0.000000
21 22 @0@ 0.000000
21 31 @0@ 0.000000
23 24 @0@ 0.000000
23 32 @0@ 0.000000
25 26 @0@ 0.000000
25 33 @0@ 0.000000
58 58 0.000000
$ hfst-fst2txt .generated/strings.all.default.hfst
0 0 @_IDENTITY_SYMBOL_@ @_IDENTITY_SYMBOL_@ 0.000000
0 0 0.000000
0 0 С С 0.000000
0 0 Щ Щ 0.000000
0 0 а а 0.000000
0 0 г г 0.000000
0 0 е е 0.000000
0 0 и и 0.000000
0 0 й й 0.000000
0 0 к к 0.000000
0 0 н н 0.000000
0 0 о о 0.000000
0 0 р р 0.000000
0 0 с с 0.000000
0 0 т т 0.000000
0 0 у у 0.000000
0 0 щ щ 0.000000
0 0 ы ы 0.000000
0 0 ь ь 0.000000
0 0 э э 0.000000
0 0 ю ю 0.000000
0 0 я я 0.000000
0 0 ӈ ӈ 0.000000
0 0 ӣ ӣ 0.000000
0 0 ӯ ӯ 0.000000
0 1 @0@ @0@ 0.000000
1 2 с щ 0.000000
1 4 т к 0.000000
1 9 т т 0.000000
1 13 я я 0.000000
1 15 э э 0.000000
1 17 ы ы 0.000000
1 19 ю ю 0.000000
1 21 а а 0.000000
1 23 е е 0.000000
1 25 о о 0.000000
1 34 н ӈ 0.000000
1 36 ӈ н 0.000000
1 38 г ӈ 0.000000
1 41 С Щ 0.000000
1 45 н н 0.000000
1 48 ӯ ӯ 0.000000
1 51 у у 0.000000
2 3 ь @0@ 0.000000
2 40 ю у 0.000000
2 43 я а 0.000000
3 58 @0@ @0@ 1.000000
4 5 и и 0.000000
4 6 ӣ ӣ 0.000000
4 7 е е 0.000000
5 58 @0@ @0@ 2.000000
6 58 @0@ @0@ 2.000000
7 8 0.000000
7 58 @0@ @0@ 2.000000
8 58 @0@ @0@ 2.000000
9 10 т к 0.000000
10 11 е е 0.000000
10 12 и и 0.000000
11 58 @0@ @0@ 2.000000
12 58 @0@ @0@ 2.000000
13 14 @0@ 0.000000
13 27 @0@ 0.000000
14 58 @0@ @0@ 1.000000
15 16 @0@ 0.000000
15 28 @0@ 0.000000
16 58 @0@ @0@ 1.000000
17 18 @0@ 0.000000
17 29 @0@ 0.000000
18 58 @0@ @0@ 1.000000
19 20 @0@ 0.000000
19 30 @0@ 0.000000
20 58 @0@ @0@ 1.000000
21 22 @0@ 0.000000
21 31 @0@ 0.000000
22 58 @0@ @0@ 1.000000
23 24 @0@ 0.000000
23 32 @0@ 0.000000
24 58 @0@ @0@ 1.000000
25 26 @0@ 0.000000
25 33 @0@ 0.000000
26 58 @0@ @0@ 1.000000
27 58 @0@ @0@ 2.000000
28 58 @0@ @0@ 2.000000
29 58 @0@ @0@ 2.000000
30 58 @0@ @0@ 2.000000
31 58 @0@ @0@ 2.000000
32 58 @0@ @0@ 2.000000
33 58 @0@ @0@ 2.000000
34 35 г @0@ 0.000000
35 58 @0@ @0@ 2.000000
36 37 @0@ г 0.000000
37 58 @0@ @0@ 2.000000
38 39 н н 0.000000
39 58 @0@ @0@ 3.000000
40 58 @0@ @0@ 2.000000
41 42 ю у 0.000000
41 44 я а 0.000000
42 58 @0@ @0@ 2.000000
43 58 @0@ @0@ 2.000000
44 58 @0@ @0@ 2.000000
45 46 т р 0.000000
46 47 р @0@ 0.000000
47 58 @0@ @0@ 4.000000
48 49 й и 0.000000
48 54 й ы 0.000000
49 50 и @0@ 0.000000
50 58 @0@ @0@ 4.000000
51 52 й и 0.000000
51 56 й ы 0.000000
52 53 и @0@ 0.000000
53 58 @0@ @0@ 4.000000
54 55 ы @0@ 0.000000
55 58 @0@ @0@ 4.000000
56 57 ы @0@ 0.000000
57 58 @0@ @0@ 4.000000
58 58 @_IDENTITY_SYMBOL_@ @_IDENTITY_SYMBOL_@ 0.000000
58 58 0.000000
58 58 С С 0.000000
58 58 Щ Щ 0.000000
58 58 а а 0.000000
58 58 г г 0.000000
58 58 е е 0.000000
58 58 и и 0.000000
58 58 й й 0.000000
58 58 к к 0.000000
58 58 н н 0.000000
58 58 о о 0.000000
58 58 р р 0.000000
58 58 с с 0.000000
58 58 т т 0.000000
58 58 у у 0.000000
58 58 щ щ 0.000000
58 58 ы ы 0.000000
58 58 ь ь 0.000000
58 58 э э 0.000000
58 58 ю ю 0.000000
58 58 я я 0.000000
58 58 ӈ ӈ 0.000000
58 58 ӣ ӣ 0.000000
58 58 ӯ ӯ 0.000000
58 0.000000 |
Do I read the above correct, @flammie, when I find no occurrences of If so, how can we force such a sequence to be treated as one symbol, in all contexts? The lexical FST does treat them as one symbol (as opposed to the tokeniser FST, which does the opposite on the input side). I assume the question relates to all |
Mm, strings compilation uses hfst-strings2fst just without any alphabets / multichars so it must consider combining characters their own arcs in the graph. I guess it leads into situation where suggestions from а̄им to вим weighs а:в and to аим weighs |
mm, that might be a good idea. I will have a look. |
giellalt/giella-core@c73d62a fixes a bug that hindered |
Ah - bummer on my part. The So we need a new type of filter that finds all combining diacritics and the corresponding base letter(s), and then generates a filter of the following type:
and then applies this filter to all error model files being read by @flammie feel free to continue this work 🙂 |
@Trondtr has reported:
Now compare this with the following:
grep а tools/spellcheckers/strings.default.txt а:а̄ 1 #а̄:я 4 а̄:а 1 ся:ща 2 Ся:Ща 2
Why is
ами
suggested as second?The core of the issue is that
а̄
has a base char + combining macron: how well / bad does the error model handle combining diacritics when it comes to suggestions?The text was updated successfully, but these errors were encountered: