Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variation in GFLang orthography definitions #7

Open
NeilSureshPatel opened this issue Oct 20, 2022 · 13 comments
Open

Variation in GFLang orthography definitions #7

NeilSureshPatel opened this issue Oct 20, 2022 · 13 comments

Comments

@NeilSureshPatel
Copy link
Contributor

@simoncozens @moyogo I have been reviewing the GFLang dataset and I am seeing a mix in the way character sets are catalogued.

Here are two examples:

bas_Latn

exemplar_chars {
  base: "a á à â ǎ ā {a᷆}{a᷇} b ɓ c d e é è ê ě ē {e᷆}{e᷇} ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆}{ɛ᷇} f g h i í ì î ǐ ī {i᷆}{i᷇} j k l m n ń ǹ ŋ o ó ò ô ǒ ō {o᷆}{o᷇} ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆}{ɔ᷇} p r s t u ú ù û ǔ ū {u᷆}{u᷇} v w y z {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇}"
  auxiliary: "q x"
  marks: "◌̀ ◌́ ◌̂ ◌̄ ◌̌ ◌᷆ ◌᷇"
  numerals: "  - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
  index: "A B Ɓ C D E Ɛ F G H I J K L M N Ŋ O Ɔ P R S T U V W Y Z"
}

bin_Latn

exemplar_chars {
  base: "A B D E F G H I K L M N O P R S T U V W Y Z Á É È Ẹ Í Ó Ò Ọ Ú a b d e f g h i k l m n o p r s t u v w y z á é è ẹ í ó ò ọ ú \'"
  marks: "◌̀ ◌́ ◌̣"
}

bas_Latn has base and auxiliary characters broken out whereas bin_Latn does not. Also what is interesting is that the bas_latn also maps out the base/mark pairs that are not precomposed. I really like having that data right in GFLang. As we update GFLang we could make this a consistent practice. Then we could run the no_orphaned_marks check by default and not have to manually add it to a shaperglot profile.

Secondly, we probably should also have the orthographies check look for an auxiliary category and test for those glyphs as well.

@NeilSureshPatel
Copy link
Contributor Author

It looks these non-precomposed pairs in the ortho list cause the following error when running shaperglot.

  File "/home/neilspatel/.local/bin/shaperglot", line 8, in <module>
    sys.exit(main())
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/cli.py", line 97, in main
    options.func(options)
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/cli.py", line 47, in check
    results = checker.check(langs[lang])
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/checker.py", line 32, in check
    check_object.execute(self)
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/checks/orthographies.py", line 30, in execute
    missing = [x for x in self.bases if ord(x) not in checker.cmap]
  File "/home/neilspatel/.local/lib/python3.9/site-packages/shaperglot/checks/orthographies.py", line 30, in <listcomp>
    missing = [x for x in self.bases if ord(x) not in checker.cmap]
TypeError: ord() expected a character, but string of length 4 found```

If we decide to include non-precomposed base/mark pairs in the ortho we need to filter them out for the orthographies check.

@simoncozens
Copy link
Collaborator

OK, while we are waiting for gflang to be better, I'll improve the parsing of decomposed glyphs.

@simoncozens
Copy link
Collaborator

I've changed things so that we can test whether either precomposed or decomposed glyphs are shapable by the font. To do this, I've run the glyph-or-glyphs through Harfbuzz and looked for any .notdefs - this is actually a better test than going through the cmap table, so this is a nice improvement.

@NeilSureshPatel
Copy link
Contributor Author

Thanks, I'll give it try. That does sound more robust. I guess its more like how you do it in the orphaned marks test.

@moyogo
Copy link
Contributor

moyogo commented Oct 31, 2022

@simoncozens Does this glyph-or-glyphs do the same as https://github.com/enabling-languages/python-i18n/wiki/icu.CanonicalIterator ?

@NeilSureshPatel
Copy link
Contributor Author

I don't believe it does check all permutations, but that would be nice rather than having to include all permutations in the exemplar characters.

@NeilSureshPatel
Copy link
Contributor Author

I guess we really don't anything as complicated as icu.CanonicalIterator, since harfbuzz is going to collapse what it can into precomposed marks. All that is required is getting all permutations of the mark sequence following the base. I just built something in the script I am using to generate the no_orphaned_marks tests that seems to work.

    for basemark in basemarks:
        if len(basemark) > 2:
            base_only = basemark[0]
            marks_only = basemark[1:len(basemark)]
            for i in itertools.permutations(marks_only, len(marks_only)):
                new_basemark = base_only.join(i)
                new_basemarks.append(new_basemark)

In anycase, I don't think we want the all the permutations in gflang. We already have the subsets of all multiple mark combinations in the exemplar_chars list. We should be able to handle all the combinations either directly in the shaperglot check or enumerate them in the shaperglot language profile.

@simoncozens
Copy link
Collaborator

Yeah, I'm being silly. We definitely don't want all permutations in gflang, because if we want decomposition we can handle it ourselves.

I don't even know if it's worthwhile to test all decomposition permutations, given that the first thing Harfbuzz is going to do when it sees the text is normalize it.

@NeilSureshPatel
Copy link
Contributor Author

I think I need to do a couple tests to be sure. If I recall correctly, for Yoruba e\u0323\u0300 and e\u0300\u0323 has different cluster behavior, which can result in one sequence having an orphaned mark and the other not. Let me confirm this.

@simoncozens
Copy link
Collaborator

That sort of thing is certainly true for syllabic scripts like Myanmar: which is precisely why we don’t want to be throwing every permutation at the shaper - not all of them will be orthographically correct.

@NeilSureshPatel
Copy link
Contributor Author

NeilSureshPatel commented Nov 1, 2022

That probably means that permutations are best handled specifically in the shaperglot language profile when it is orthographically appropriate and not automatically within the shaperglot check.

@NeilSureshPatel
Copy link
Contributor Author

@simon, correct me if I am wrong, but the way Shaperglot instantiates HarfBuzz there is no font fall back so we are only looking at the specific fonts being tested.

@NeilSureshPatel
Copy link
Contributor Author

NeilSureshPatel commented Nov 2, 2022

I think I need to do a couple tests to be sure. If I recall correctly, for Yoruba e\u0323\u0300 and e\u0300\u0323 has different cluster behavior, which can result in one sequence having an orphaned mark and the other not. Let me confirm this.

Ok, I ran this test through shaperglot and printed out the buffers. Test1 is e\u0323\u0301 against e\u0301\u0323 and test2 is e\u0323\u0301 against é\u0323

- check: shaping_differs
  inputs:
    - text: "ẹ́"
    - text: "ẹ́"
      language: "ro"
  differs:
    - cluster: 0
      glyph: 0
    - cluster: 0
      glyph: 0
  rationale: "in Yoruba"
- check: shaping_differs
  inputs:
    - text: "ẹ́"
    - text: "ẹ́"
      language: "ro"
  differs:
    - cluster: 0
      glyph: 0
    - cluster: 0
      glyph: 0
  rationale: "in Yoruba"

The output is definitely normalized.

Test1
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0

Test2
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0
uni1EB9=0+506|acutecomb=0@-317,0+0 uni1EB9=0+506|acutecomb=0@-317,0+0

I think that settles the question about permutations.

NeilSureshPatel added a commit to NeilSureshPatel/shaperglot that referenced this issue Feb 6, 2023
* Update checker.py

Added mark2base test that uses the serialized buffer to see if a mark has a GPOS shift if placed after a target base mark.

* Use shaper to check whether glyphs exist, see googlefonts#7

* Add youseedee to requirements

* Fix some lints

* Read your own config file, pylint

* More pylint fixes

* Pin protobuf dependency

* Further poetry dependency fixes

* Cache shaping

* Fix error message

* Implement an "unknown" state

* Implement the "report" option

* Speed up the mark checker

* Don't GSUB closure on pathological fonts

* Make pylint happier

* Make result status machine readable

* A new test for unencoded glyph variants. Fixes googlefonts#8

* Use the language tag from the language we're checking

* Skip tests based on certain conditions (missing features), fixes googlefonts#11

* Make linter happier

* Update orthographies check to include auxiliary chars

There is probably a more elegant way to implement this but I have merged auxiliary characters into the bases for the orthographies check. For the purposes of language support testing base and auxiliary characters need to be included to ensure loan words, names and place names can all be typed for a given language.

* Improve error messages

* Add Neil's work

* Pylint stuff

* Update shaping_differs.py

Fixed Type Error caused by trying to concat YAML to str

* Make non-verbose less verbose

* Transfer IP to Google

---------

Co-authored-by: Simon Cozens <[email protected]>
Co-authored-by: Dave Crossland <[email protected]>
NeilSureshPatel added a commit to NeilSureshPatel/shaperglot that referenced this issue Mar 14, 2023
* Update checker.py

Added mark2base test that uses the serialized buffer to see if a mark has a GPOS shift if placed after a target base mark.

* Use shaper to check whether glyphs exist, see googlefonts#7

* Add youseedee to requirements

* Fix some lints

* Read your own config file, pylint

* More pylint fixes

* Pin protobuf dependency

* Further poetry dependency fixes

* Cache shaping

* Fix error message

* Implement an "unknown" state

* Implement the "report" option

* Speed up the mark checker

* Don't GSUB closure on pathological fonts

* Make pylint happier

* Make result status machine readable

* A new test for unencoded glyph variants. Fixes googlefonts#8

* Use the language tag from the language we're checking

* Skip tests based on certain conditions (missing features), fixes googlefonts#11

* Make linter happier

* Update orthographies check to include auxiliary chars

There is probably a more elegant way to implement this but I have merged auxiliary characters into the bases for the orthographies check. For the purposes of language support testing base and auxiliary characters need to be included to ensure loan words, names and place names can all be typed for a given language.

* Improve error messages

* Add Neil's work

* Pylint stuff

* Update shaping_differs.py

Fixed Type Error caused by trying to concat YAML to str

* Make non-verbose less verbose

* Transfer IP to Google

---------

Co-authored-by: Simon Cozens <[email protected]>
Co-authored-by: Dave Crossland <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants