Variation in GFLang orthography definitions #7

NeilSureshPatel · 2022-10-20T20:15:57Z

@simoncozens @moyogo I have been reviewing the GFLang dataset and I am seeing a mix in the way character sets are catalogued.

Here are two examples:

bas_Latn

exemplar_chars {
  base: "a á à â ǎ ā {a᷆}{a᷇} b ɓ c d e é è ê ě ē {e᷆}{e᷇} ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆}{ɛ᷇} f g h i í ì î ǐ ī {i᷆}{i᷇} j k l m n ń ǹ ŋ o ó ò ô ǒ ō {o᷆}{o᷇} ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆}{ɔ᷇} p r s t u ú ù û ǔ ū {u᷆}{u᷇} v w y z {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇}"
  auxiliary: "q x"
  marks: "◌̀ ◌́ ◌̂ ◌̄ ◌̌ ◌᷆ ◌᷇"
  numerals: "  - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
  index: "A B Ɓ C D E Ɛ F G H I J K L M N Ŋ O Ɔ P R S T U V W Y Z"
}

bin_Latn

exemplar_chars {
  base: "A B D E F G H I K L M N O P R S T U V W Y Z Á É È Ẹ Í Ó Ò Ọ Ú a b d e f g h i k l m n o p r s t u v w y z á é è ẹ í ó ò ọ ú \'"
  marks: "◌̀ ◌́ ◌̣"
}

bas_Latn has base and auxiliary characters broken out whereas bin_Latn does not. Also what is interesting is that the bas_latn also maps out the base/mark pairs that are not precomposed. I really like having that data right in GFLang. As we update GFLang we could make this a consistent practice. Then we could run the no_orphaned_marks check by default and not have to manually add it to a shaperglot profile.

Secondly, we probably should also have the orthographies check look for an auxiliary category and test for those glyphs as well.

The text was updated successfully, but these errors were encountered:

NeilSureshPatel · 2022-10-28T20:01:01Z

It looks these non-precomposed pairs in the ortho list cause the following error when running shaperglot.

  File "/home/neilspatel/.local/bin/shaperglot", line 8, in <module>
    sys.exit(main())
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/cli.py", line 97, in main
    options.func(options)
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/cli.py", line 47, in check
    results = checker.check(langs[lang])
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/checker.py", line 32, in check
    check_object.execute(self)
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/checks/orthographies.py", line 30, in execute
    missing = [x for x in self.bases if ord(x) not in checker.cmap]
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/checks/orthographies.py", line 30, in <listcomp>
    missing = [x for x in self.bases if ord(x) not in checker.cmap]
TypeError: ord() expected a character, but string of length 4 found```

If we decide to include non-precomposed base/mark pairs in the ortho we need to filter them out for the orthographies check.

simoncozens · 2022-10-31T11:44:53Z

OK, while we are waiting for gflang to be better, I'll improve the parsing of decomposed glyphs.

simoncozens · 2022-10-31T13:35:30Z

I've changed things so that we can test whether either precomposed or decomposed glyphs are shapable by the font. To do this, I've run the glyph-or-glyphs through Harfbuzz and looked for any .notdefs - this is actually a better test than going through the cmap table, so this is a nice improvement.

NeilSureshPatel · 2022-10-31T14:49:15Z

Thanks, I'll give it try. That does sound more robust. I guess its more like how you do it in the orphaned marks test.

moyogo · 2022-10-31T18:37:14Z

@simoncozens Does this glyph-or-glyphs do the same as https://github.com/enabling-languages/python-i18n/wiki/icu.CanonicalIterator ?

NeilSureshPatel · 2022-10-31T18:53:44Z

I don't believe it does check all permutations, but that would be nice rather than having to include all permutations in the exemplar characters.

NeilSureshPatel · 2022-11-01T21:32:48Z

I guess we really don't anything as complicated as icu.CanonicalIterator, since harfbuzz is going to collapse what it can into precomposed marks. All that is required is getting all permutations of the mark sequence following the base. I just built something in the script I am using to generate the no_orphaned_marks tests that seems to work.

    for basemark in basemarks:
        if len(basemark) > 2:
            base_only = basemark[0]
            marks_only = basemark[1:len(basemark)]
            for i in itertools.permutations(marks_only, len(marks_only)):
                new_basemark = base_only.join(i)
                new_basemarks.append(new_basemark)

In anycase, I don't think we want the all the permutations in gflang. We already have the subsets of all multiple mark combinations in the exemplar_chars list. We should be able to handle all the combinations either directly in the shaperglot check or enumerate them in the shaperglot language profile.

simoncozens · 2022-11-01T21:37:28Z

Yeah, I'm being silly. We definitely don't want all permutations in gflang, because if we want decomposition we can handle it ourselves.

I don't even know if it's worthwhile to test all decomposition permutations, given that the first thing Harfbuzz is going to do when it sees the text is normalize it.

NeilSureshPatel · 2022-11-01T21:54:55Z

I think I need to do a couple tests to be sure. If I recall correctly, for Yoruba e\u0323\u0300 and e\u0300\u0323 has different cluster behavior, which can result in one sequence having an orphaned mark and the other not. Let me confirm this.

simoncozens · 2022-11-01T22:09:01Z

That sort of thing is certainly true for syllabic scripts like Myanmar: which is precisely why we don’t want to be throwing every permutation at the shaper - not all of them will be orthographically correct.

NeilSureshPatel · 2022-11-01T22:38:05Z

That probably means that permutations are best handled specifically in the shaperglot language profile when it is orthographically appropriate and not automatically within the shaperglot check.

NeilSureshPatel · 2022-11-02T14:54:30Z

@simon, correct me if I am wrong, but the way Shaperglot instantiates HarfBuzz there is no font fall back so we are only looking at the specific fonts being tested.

NeilSureshPatel · 2022-11-02T20:57:27Z

I think I need to do a couple tests to be sure. If I recall correctly, for Yoruba e\u0323\u0300 and e\u0300\u0323 has different cluster behavior, which can result in one sequence having an orphaned mark and the other not. Let me confirm this.

Ok, I ran this test through shaperglot and printed out the buffers. Test1 is e\u0323\u0301 against e\u0301\u0323 and test2 is e\u0323\u0301 against é\u0323

- check: shaping_differs
  inputs:
    - text: "ẹ́"
    - text: "ẹ́"
      language: "ro"
  differs:
    - cluster: 0
      glyph: 0
    - cluster: 0
      glyph: 0
  rationale: "in Yoruba"
- check: shaping_differs
  inputs:
    - text: "ẹ́"
    - text: "ẹ́"
      language: "ro"
  differs:
    - cluster: 0
      glyph: 0
    - cluster: 0
      glyph: 0
  rationale: "in Yoruba"

The output is definitely normalized.

Test1
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0

Test2
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0

I think that settles the question about permutations.

* Update checker.py Added mark2base test that uses the serialized buffer to see if a mark has a GPOS shift if placed after a target base mark. * Use shaper to check whether glyphs exist, see googlefonts#7 * Add youseedee to requirements * Fix some lints * Read your own config file, pylint * More pylint fixes * Pin protobuf dependency * Further poetry dependency fixes * Cache shaping * Fix error message * Implement an "unknown" state * Implement the "report" option * Speed up the mark checker * Don't GSUB closure on pathological fonts * Make pylint happier * Make result status machine readable * A new test for unencoded glyph variants. Fixes googlefonts#8 * Use the language tag from the language we're checking * Skip tests based on certain conditions (missing features), fixes googlefonts#11 * Make linter happier * Update orthographies check to include auxiliary chars There is probably a more elegant way to implement this but I have merged auxiliary characters into the bases for the orthographies check. For the purposes of language support testing base and auxiliary characters need to be included to ensure loan words, names and place names can all be typed for a given language. * Improve error messages * Add Neil's work * Pylint stuff * Update shaping_differs.py Fixed Type Error caused by trying to concat YAML to str * Make non-verbose less verbose * Transfer IP to Google --------- Co-authored-by: Simon Cozens <[email protected]> Co-authored-by: Dave Crossland <[email protected]>

simoncozens added a commit that referenced this issue Oct 31, 2022

Use shaper to check whether glyphs exist, see #7

74a1bb5

simoncozens mentioned this issue Oct 31, 2022

Remove duplicates from languages exemplar_chars googlefonts/lang#18

Merged

moyogo mentioned this issue Nov 1, 2022

Test languages exemplars canonical duplicates googlefonts/lang#41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variation in GFLang orthography definitions #7

Variation in GFLang orthography definitions #7

NeilSureshPatel commented Oct 20, 2022

NeilSureshPatel commented Oct 28, 2022

simoncozens commented Oct 31, 2022

simoncozens commented Oct 31, 2022

NeilSureshPatel commented Oct 31, 2022

moyogo commented Oct 31, 2022

NeilSureshPatel commented Oct 31, 2022

NeilSureshPatel commented Nov 1, 2022

simoncozens commented Nov 1, 2022

NeilSureshPatel commented Nov 1, 2022

simoncozens commented Nov 1, 2022

NeilSureshPatel commented Nov 1, 2022 •

edited

Loading

NeilSureshPatel commented Nov 2, 2022

NeilSureshPatel commented Nov 2, 2022 •

edited

Loading

Variation in GFLang orthography definitions #7

Variation in GFLang orthography definitions #7

Comments

NeilSureshPatel commented Oct 20, 2022

NeilSureshPatel commented Oct 28, 2022

simoncozens commented Oct 31, 2022

simoncozens commented Oct 31, 2022

NeilSureshPatel commented Oct 31, 2022

moyogo commented Oct 31, 2022

NeilSureshPatel commented Oct 31, 2022

NeilSureshPatel commented Nov 1, 2022

simoncozens commented Nov 1, 2022

NeilSureshPatel commented Nov 1, 2022

simoncozens commented Nov 1, 2022

NeilSureshPatel commented Nov 1, 2022 • edited Loading

NeilSureshPatel commented Nov 2, 2022

NeilSureshPatel commented Nov 2, 2022 • edited Loading

NeilSureshPatel commented Nov 1, 2022 •

edited

Loading

NeilSureshPatel commented Nov 2, 2022 •

edited

Loading