-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variation in GFLang orthography definitions #7
Comments
It looks these non-precomposed pairs in the ortho list cause the following error when running shaperglot.
|
OK, while we are waiting for gflang to be better, I'll improve the parsing of decomposed glyphs. |
I've changed things so that we can test whether either precomposed or decomposed glyphs are shapable by the font. To do this, I've run the glyph-or-glyphs through Harfbuzz and looked for any .notdefs - this is actually a better test than going through the cmap table, so this is a nice improvement. |
Thanks, I'll give it try. That does sound more robust. I guess its more like how you do it in the orphaned marks test. |
@simoncozens Does this glyph-or-glyphs do the same as https://github.com/enabling-languages/python-i18n/wiki/icu.CanonicalIterator ? |
I don't believe it does check all permutations, but that would be nice rather than having to include all permutations in the exemplar characters. |
I guess we really don't anything as complicated as
In anycase, I don't think we want the all the permutations in gflang. We already have the subsets of all multiple mark combinations in the exemplar_chars list. We should be able to handle all the combinations either directly in the shaperglot check or enumerate them in the shaperglot language profile. |
Yeah, I'm being silly. We definitely don't want all permutations in gflang, because if we want decomposition we can handle it ourselves. I don't even know if it's worthwhile to test all decomposition permutations, given that the first thing Harfbuzz is going to do when it sees the text is normalize it. |
I think I need to do a couple tests to be sure. If I recall correctly, for Yoruba |
That sort of thing is certainly true for syllabic scripts like Myanmar: which is precisely why we don’t want to be throwing every permutation at the shaper - not all of them will be orthographically correct. |
That probably means that permutations are best handled specifically in the shaperglot language profile when it is orthographically appropriate and not automatically within the shaperglot check. |
@simon, correct me if I am wrong, but the way Shaperglot instantiates HarfBuzz there is no font fall back so we are only looking at the specific fonts being tested. |
Ok, I ran this test through shaperglot and printed out the buffers. Test1 is
The output is definitely normalized.
I think that settles the question about permutations. |
* Update checker.py Added mark2base test that uses the serialized buffer to see if a mark has a GPOS shift if placed after a target base mark. * Use shaper to check whether glyphs exist, see googlefonts#7 * Add youseedee to requirements * Fix some lints * Read your own config file, pylint * More pylint fixes * Pin protobuf dependency * Further poetry dependency fixes * Cache shaping * Fix error message * Implement an "unknown" state * Implement the "report" option * Speed up the mark checker * Don't GSUB closure on pathological fonts * Make pylint happier * Make result status machine readable * A new test for unencoded glyph variants. Fixes googlefonts#8 * Use the language tag from the language we're checking * Skip tests based on certain conditions (missing features), fixes googlefonts#11 * Make linter happier * Update orthographies check to include auxiliary chars There is probably a more elegant way to implement this but I have merged auxiliary characters into the bases for the orthographies check. For the purposes of language support testing base and auxiliary characters need to be included to ensure loan words, names and place names can all be typed for a given language. * Improve error messages * Add Neil's work * Pylint stuff * Update shaping_differs.py Fixed Type Error caused by trying to concat YAML to str * Make non-verbose less verbose * Transfer IP to Google --------- Co-authored-by: Simon Cozens <[email protected]> Co-authored-by: Dave Crossland <[email protected]>
* Update checker.py Added mark2base test that uses the serialized buffer to see if a mark has a GPOS shift if placed after a target base mark. * Use shaper to check whether glyphs exist, see googlefonts#7 * Add youseedee to requirements * Fix some lints * Read your own config file, pylint * More pylint fixes * Pin protobuf dependency * Further poetry dependency fixes * Cache shaping * Fix error message * Implement an "unknown" state * Implement the "report" option * Speed up the mark checker * Don't GSUB closure on pathological fonts * Make pylint happier * Make result status machine readable * A new test for unencoded glyph variants. Fixes googlefonts#8 * Use the language tag from the language we're checking * Skip tests based on certain conditions (missing features), fixes googlefonts#11 * Make linter happier * Update orthographies check to include auxiliary chars There is probably a more elegant way to implement this but I have merged auxiliary characters into the bases for the orthographies check. For the purposes of language support testing base and auxiliary characters need to be included to ensure loan words, names and place names can all be typed for a given language. * Improve error messages * Add Neil's work * Pylint stuff * Update shaping_differs.py Fixed Type Error caused by trying to concat YAML to str * Make non-verbose less verbose * Transfer IP to Google --------- Co-authored-by: Simon Cozens <[email protected]> Co-authored-by: Dave Crossland <[email protected]>
@simoncozens @moyogo I have been reviewing the GFLang dataset and I am seeing a mix in the way character sets are catalogued.
Here are two examples:
bas_Latn
bin_Latn
bas_Latn has base and auxiliary characters broken out whereas bin_Latn does not. Also what is interesting is that the bas_latn also maps out the base/mark pairs that are not precomposed. I really like having that data right in GFLang. As we update GFLang we could make this a consistent practice. Then we could run the
no_orphaned_marks
check by default and not have to manually add it to a shaperglot profile.Secondly, we probably should also have the
orthographies
check look for an auxiliary category and test for those glyphs as well.The text was updated successfully, but these errors were encountered: