Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Arabic words look separated when broken at soft hyphen #1415

Merged
merged 3 commits into from
Nov 8, 2024

Conversation

MoamenAbdelsattar
Copy link
Contributor

This change fixes the issue I described in #1414 It works, but I'm not sure whether it's clean code 😅 sorry.

This change fixes the issue I described in vivliostyle#1414
It works, but I'm not sure whether it's clean code 😅 sorry.
Copy link

vercel bot commented Nov 6, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
vivliostyle ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 8, 2024 3:05am

@CLAassistant
Copy link

CLAassistant commented Nov 6, 2024

CLA assistant check
All committers have signed the CLA.

@MurakamiShinyu
Copy link
Member

@MoamenAbdelsattar Thank you for your contribution!

I checked your code and found some problems.

textNode.replaceData(
viewIndex,
text.length - viewIndex,
!nodeContext.breakWord ? resolveHyphenateCharacter(nodeContext) : "",
);
let p = nodeContext.preprocessedTextContent[0][1];
nodeContext.preprocessedTextContent[0][1] = p.slice(0, viewIndex + 1) + "\u200d" + p.slice(viewIndex + 1); // insert zero-width joiner

First, the build failed as shown in https://github.com/vivliostyle/vivliostyle.js/actions/runs/11707366253/job/32629434013:

Error: /home/runner/work/vivliostyle.js/vivliostyle.js/packages/core/src/vivliostyle/layout.ts(3919,51): semantic error TS2339: Property 'slice' does not exist on type 'string | number'.

This error can be fixed by adding as string to let p = nodeContext.preprocessedTextContent[0][1], and the build will pass. However, when testing the code, I could not see any effect of the change. The Arabic words still look separated when broken at soft hyphen.

You seem to have added a zero-width joiner character \u200d after the soft hyphen character in nodeContext.preprocessedTextContent[0][1]. I think this is wrong.

I made a fixed version as follows:

  breakAfterSoftHyphen(
    textNode: Text,
    text: string,
    viewIndex: number,
    nodeContext: Vtree.NodeContext,
  ): number {
    // convert trailing soft hyphen to a real hyphen

    // if the previous character is a shaping character, such as Arabic,
    // insert a zero-width joiner before the hyphen.
    const prev = text.charCodeAt(viewIndex - 1);
    const zwj = prev >= 0x600 && prev <= 0x7af ? "\u200d" : "";
    textNode.replaceData(
      viewIndex,
      text.length - viewIndex,
      !nodeContext.breakWord
        ? zwj + resolveHyphenateCharacter(nodeContext)
        : "",
    );
    return viewIndex + 1;
  }

I tested this fixed version with some test text with Arabic letters with soft hyphens (uyghur-shy.html), and it worked as expected. Also it does not affect other languages that do not require using zero-width joiner (english-shy.html).

One more request: Please prefix the commit message with "fix: " when fixing a bug.

@MoamenAbdelsattar
Copy link
Contributor Author

MoamenAbdelsattar commented Nov 7, 2024

Hi @MurakamiShinyu
Thanks for the notes. Sorry for the typescript error, I tested my change on the built browserified javascript version and it worked for me.

The critical Zero width joiner must be added both before and after the break. The separated form of letter before the break is not critical and can be fixed by modifying hyphenate-character and adding zero-width joiner to it. The critical problem is the line after the break. The first letter in the line after the break must look connected to the letter before it.

Regarding languages that don't require zero-width joiner I think zero-width joiner will not affect shaping, so I was thinking that as an optimization this step could be ignored.

@MoamenAbdelsattar
Copy link
Contributor Author

Please note: some Arabic letters are not connected to the letter after them, these are ا أ إ آ ر ز د ذ و ء, if a soft hyphen is present after them: the first letter in the broken line should not be in the connected form. In my case I wasn't inserting any soft-hyphen after them, so I forgot to take them into account.

@MoamenAbdelsattar
Copy link
Contributor Author

MoamenAbdelsattar commented Nov 7, 2024

Unicode provides a property that tells whether a character is Joined or not, However javascript seems to not support reading this property from regex.
I tried "م".match(/\p{Joining_Type=Dual_Joining}/u) It gives: SyntaxError: Invalid regular expression: /\p{Joining_Type=Dual_Joining}/u: Invalid property name.

Zero-width joiner must be added both before and after the break, but the ZWJ after the break must be only added if the last letter (and not diacritical mark) before the break has Joining_Type=Dual_Joining

@MurakamiShinyu
Copy link
Member

If /\p{Joining_Type=Dual_Joining}/u works the following code would be fine, but not, unfortunately. The Unicode properties available in ECMAScript are limited: https://tc39.es/ecma262/multipage/text-processing.html#table-nonbinary-unicode-properties

    const prev = text.charAt(viewIndex - 1);
    const zwj = /\p{Joining_Type=Dual_Joining}/u // ERROR: "Unknown Unicode property name."
      .test(prev)
      ? "\u200d"
      : "";
    textNode.replaceData(
      viewIndex,
      text.length - viewIndex,
      !nodeContext.breakWord
        ? zwj + resolveHyphenateCharacter(nodeContext)
        : "",
    );
    if (zwj) {
      const p = nodeContext.preprocessedTextContent[0][1] as string;
      nodeContext.preprocessedTextContent[0][1] =
        p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
    }

@MoamenAbdelsattar
Copy link
Contributor Author

MoamenAbdelsattar commented Nov 7, 2024

Please note: the previous character could be a diacritical mark, which doesn't affect joining (Joining_Type=Transparent). If the property is supported, the code must be as following:

    let i = viewIndex - 1;
    let prev;
    while(i >= 0){
        prev = text.charAt(i);
        if(prev.match(/\p{Joining_Type=Transparent}/u) i--; // skip this character
        else break;
    }
    
    
    const zwj = /\p{Joining_Type=Dual_Joining}/u // ERROR: "Unknown Unicode property name."
      .test(prev)
      ? "\u200d"
      : "";
    textNode.replaceData(
      viewIndex,
      text.length - viewIndex,
      !nodeContext.breakWord
        ? zwj + resolveHyphenateCharacter(nodeContext)
        : "",
    );
    if (zwj) {
      const p = nodeContext.preprocessedTextContent[0][1] as string;
      nodeContext.preprocessedTextContent[0][1] =
        p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
    }

@MoamenAbdelsattar
Copy link
Contributor Author

Could https://github.com/foliojs/codepoints Help?

@MoamenAbdelsattar
Copy link
Contributor Author

Hi, I've made https://github.com/MoamenAbdelsattar/ArabicShapingJS as a library as it can be a useful component of other projects. Now the code can be as following:

    let i = viewIndex - 1;
    let prev;
    while(i >= 0){
        prev = text.charAt(i);
        if(joiningType[prev] == "T") i--; // skip this character
        else break;
    }
    
    
    const zwj = joiningType[prev] == "D"
      ? "\u200d"
      : "";
    textNode.replaceData(
      viewIndex,
      text.length - viewIndex,
      !nodeContext.breakWord
        ? zwj + resolveHyphenateCharacter(nodeContext)
        : "",
    );
    if (zwj) {
      const p = nodeContext.preprocessedTextContent[0][1] as string;
      nodeContext.preprocessedTextContent[0][1] =
        p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
    }

This code works as expected for me.

@MurakamiShinyu
Copy link
Member

The following code may be more compact,
using character range generated with
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BJoining_Type%3DD%7D&abb=on&esc=on&g=&i=
instead of adding a library.

    const checkJoiningTypeD =
      // /\p{Joining_Type=D}\p{Mn}*$/
      /[\u0620\u0626\u0628\u062A-\u062E\u0633-\u063F\u0641-\u0647\u0649\u064A\u066E\u066F\u0678-\u0687\u069A-\u06BF\u06C1\u06C2\u06CC\u06CE\u06D0\u06D1\u06FA-\u06FC\u06FF\u0712-\u0714\u071A-\u071D\u071F-\u0727\u0729\u072B\u072D\u072E\u074E-\u0758\u075C-\u076A\u076D-\u0770\u0772\u0775-\u0777\u077A-\u077F\u07CA-\u07EA\u0841-\u0845\u0848\u084A-\u0853\u0855\u0860\u0862-\u0865\u0868\u0886\u0889-\u088D\u08A0-\u08A9\u08AF\u08B0\u08B3-\u08B8\u08BA-\u08C8\u1807\u1820-\u1878\u1887-\u18A8\u18AA\uA840-\uA871\u{10AC0}-\u{10AC4}\u{10AD3}-\u{10AD6}\u{10AD8}-\u{10ADC}\u{10ADE}-\u{10AE0}\u{10AEB}-\u{10AEE}\u{10B80}\u{10B82}\u{10B86}-\u{10B88}\u{10B8A}\u{10B8B}\u{10B8D}\u{10B90}\u{10BAD}\u{10BAE}\u{10D01}-\u{10D21}\u{10D23}\u{10EC3}\u{10EC4}\u{10F30}-\u{10F32}\u{10F34}-\u{10F44}\u{10F51}-\u{10F53}\u{10F70}-\u{10F73}\u{10F76}-\u{10F81}\u{10FB0}\u{10FB2}\u{10FB3}\u{10FB8}\u{10FBB}\u{10FBC}\u{10FBE}\u{10FBF}\u{10FC1}\u{10FC4}\u{10FCA}\u{1E900}-\u{1E943}]\p{Mn}*$/u;
    const zwj = checkJoiningTypeD.test(text.slice(0, viewIndex))
      ? "\u200D"
      : "";
    textNode.replaceData(
      viewIndex,
      text.length - viewIndex,
      !nodeContext.breakWord
        ? zwj + resolveHyphenateCharacter(nodeContext)
        : "",
    );
    if (zwj) {
      const p = nodeContext.preprocessedTextContent[0][1] as string;
      nodeContext.preprocessedTextContent[0][1] =
        p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
    }

@MoamenAbdelsattar
Copy link
Contributor Author

Great! There is a final touch, I recently realized that zero-width joiner and Arabic Tatweel belong to \p{Joining_Type=C}, and they are dual joined, so we need to include these additional code points [\u0640\u07FA\u0883-\u0885\u180A\u200D], from https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BJoining_Type%3DC%7D&abb=on&esc=on&g=&i=

@MurakamiShinyu
Copy link
Member

Copy link
Member

@MurakamiShinyu MurakamiShinyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@MurakamiShinyu MurakamiShinyu merged commit 7553e21 into vivliostyle:master Nov 8, 2024
4 checks passed
@MoamenAbdelsattar
Copy link
Contributor Author

Thanks! It's now working as expected, I think, however, its performance can be improved for scalability, I mean this long regex pattern can take a lot of time to test, a prebuilt hashmap or a dictionary mapping a character to its joining type might be faster, but I'm not sure how much is the influence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants