Fix Arabic words look separated when broken at soft hyphen #1415

MoamenAbdelsattar · 2024-11-06T16:04:35Z

This change fixes the issue I described in #1414 It works, but I'm not sure whether it's clean code 😅 sorry.

This change fixes the issue I described in vivliostyle#1414 It works, but I'm not sure whether it's clean code 😅 sorry.

vercel · 2024-11-06T16:04:40Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
vivliostyle	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 8, 2024 3:05am

CLAassistant · 2024-11-06T16:04:43Z

All committers have signed the CLA.

MurakamiShinyu · 2024-11-07T05:36:29Z

@MoamenAbdelsattar Thank you for your contribution!

I checked your code and found some problems.

vivliostyle.js/packages/core/src/vivliostyle/layout.ts

Lines 3913 to 3919 in f802e59

    
           textNode.replaceData( 
        
             viewIndex, 
        
             text.length - viewIndex, 
        
             !nodeContext.breakWord ? resolveHyphenateCharacter(nodeContext) : "", 
        
           ); 
        
           let p = nodeContext.preprocessedTextContent[0][1]; 
        
           nodeContext.preprocessedTextContent[0][1] = p.slice(0, viewIndex + 1) + "\u200d" + p.slice(viewIndex + 1); // insert zero-width joiner

First, the build failed as shown in https://github.com/vivliostyle/vivliostyle.js/actions/runs/11707366253/job/32629434013:

Error: /home/runner/work/vivliostyle.js/vivliostyle.js/packages/core/src/vivliostyle/layout.ts(3919,51): semantic error TS2339: Property 'slice' does not exist on type 'string | number'.

This error can be fixed by adding as string to let p = nodeContext.preprocessedTextContent[0][1], and the build will pass. However, when testing the code, I could not see any effect of the change. The Arabic words still look separated when broken at soft hyphen.

You seem to have added a zero-width joiner character \u200d after the soft hyphen character in nodeContext.preprocessedTextContent[0][1]. I think this is wrong.

I made a fixed version as follows:

  breakAfterSoftHyphen(
    textNode: Text,
    text: string,
    viewIndex: number,
    nodeContext: Vtree.NodeContext,
  ): number {
    // convert trailing soft hyphen to a real hyphen

    // if the previous character is a shaping character, such as Arabic,
    // insert a zero-width joiner before the hyphen.
    const prev = text.charCodeAt(viewIndex - 1);
    const zwj = prev >= 0x600 && prev <= 0x7af ? "\u200d" : "";
    textNode.replaceData(
      viewIndex,
      text.length - viewIndex,
      !nodeContext.breakWord
        ? zwj + resolveHyphenateCharacter(nodeContext)
        : "",
    );
    return viewIndex + 1;
  }

I tested this fixed version with some test text with Arabic letters with soft hyphens (uyghur-shy.html), and it worked as expected. Also it does not affect other languages that do not require using zero-width joiner (english-shy.html).

One more request: Please prefix the commit message with "fix: " when fixing a bug.

MoamenAbdelsattar · 2024-11-07T06:33:24Z

Hi @MurakamiShinyu
Thanks for the notes. Sorry for the typescript error, I tested my change on the built browserified javascript version and it worked for me.

The critical Zero width joiner must be added both before and after the break. The separated form of letter before the break is not critical and can be fixed by modifying hyphenate-character and adding zero-width joiner to it. The critical problem is the line after the break. The first letter in the line after the break must look connected to the letter before it.

Regarding languages that don't require zero-width joiner I think zero-width joiner will not affect shaping, so I was thinking that as an optimization this step could be ignored.

MoamenAbdelsattar · 2024-11-07T06:53:03Z

Please note: some Arabic letters are not connected to the letter after them, these are ا أ إ آ ر ز د ذ و ء, if a soft hyphen is present after them: the first letter in the broken line should not be in the connected form. In my case I wasn't inserting any soft-hyphen after them, so I forgot to take them into account.

MoamenAbdelsattar · 2024-11-07T07:42:40Z

Unicode provides a property that tells whether a character is Joined or not, However javascript seems to not support reading this property from regex.
I tried "م".match(/\p{Joining_Type=Dual_Joining}/u) It gives: SyntaxError: Invalid regular expression: /\p{Joining_Type=Dual_Joining}/u: Invalid property name.

Zero-width joiner must be added both before and after the break, but the ZWJ after the break must be only added if the last letter (and not diacritical mark) before the break has Joining_Type=Dual_Joining

MurakamiShinyu · 2024-11-07T07:55:28Z

If /\p{Joining_Type=Dual_Joining}/u works the following code would be fine, but not, unfortunately. The Unicode properties available in ECMAScript are limited: https://tc39.es/ecma262/multipage/text-processing.html#table-nonbinary-unicode-properties

    const prev = text.charAt(viewIndex - 1);
    const zwj = /\p{Joining_Type=Dual_Joining}/u // ERROR: "Unknown Unicode property name."
      .test(prev)
      ? "\u200d"
      : "";
    textNode.replaceData(
      viewIndex,
      text.length - viewIndex,
      !nodeContext.breakWord
        ? zwj + resolveHyphenateCharacter(nodeContext)
        : "",
    );
    if (zwj) {
      const p = nodeContext.preprocessedTextContent[0][1] as string;
      nodeContext.preprocessedTextContent[0][1] =
        p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
    }

MoamenAbdelsattar · 2024-11-07T08:07:26Z

Please note: the previous character could be a diacritical mark, which doesn't affect joining (Joining_Type=Transparent). If the property is supported, the code must be as following:

    let i = viewIndex - 1;
    let prev;
    while(i >= 0){
        prev = text.charAt(i);
        if(prev.match(/\p{Joining_Type=Transparent}/u) i--; // skip this character
        else break;
    }
    
    
    const zwj = /\p{Joining_Type=Dual_Joining}/u // ERROR: "Unknown Unicode property name."
      .test(prev)
      ? "\u200d"
      : "";
    textNode.replaceData(
      viewIndex,
      text.length - viewIndex,
      !nodeContext.breakWord
        ? zwj + resolveHyphenateCharacter(nodeContext)
        : "",
    );
    if (zwj) {
      const p = nodeContext.preprocessedTextContent[0][1] as string;
      nodeContext.preprocessedTextContent[0][1] =
        p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
    }

MoamenAbdelsattar · 2024-11-07T08:11:01Z

Could https://github.com/foliojs/codepoints Help?

MoamenAbdelsattar · 2024-11-07T09:37:32Z

Hi, I've made https://github.com/MoamenAbdelsattar/ArabicShapingJS as a library as it can be a useful component of other projects. Now the code can be as following:

    let i = viewIndex - 1;
    let prev;
    while(i >= 0){
        prev = text.charAt(i);
        if(joiningType[prev] == "T") i--; // skip this character
        else break;
    }
    
    
    const zwj = joiningType[prev] == "D"
      ? "\u200d"
      : "";
    textNode.replaceData(
      viewIndex,
      text.length - viewIndex,
      !nodeContext.breakWord
        ? zwj + resolveHyphenateCharacter(nodeContext)
        : "",
    );
    if (zwj) {
      const p = nodeContext.preprocessedTextContent[0][1] as string;
      nodeContext.preprocessedTextContent[0][1] =
        p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
    }

This code works as expected for me.

MurakamiShinyu · 2024-11-07T14:16:35Z

The following code may be more compact,
using character range generated with
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BJoining_Type%3DD%7D&abb=on&esc=on&g=&i=
instead of adding a library.

    const checkJoiningTypeD =
      // /\p{Joining_Type=D}\p{Mn}*$/
      /[\u0620\u0626\u0628\u062A-\u062E\u0633-\u063F\u0641-\u0647\u0649\u064A\u066E\u066F\u0678-\u0687\u069A-\u06BF\u06C1\u06C2\u06CC\u06CE\u06D0\u06D1\u06FA-\u06FC\u06FF\u0712-\u0714\u071A-\u071D\u071F-\u0727\u0729\u072B\u072D\u072E\u074E-\u0758\u075C-\u076A\u076D-\u0770\u0772\u0775-\u0777\u077A-\u077F\u07CA-\u07EA\u0841-\u0845\u0848\u084A-\u0853\u0855\u0860\u0862-\u0865\u0868\u0886\u0889-\u088D\u08A0-\u08A9\u08AF\u08B0\u08B3-\u08B8\u08BA-\u08C8\u1807\u1820-\u1878\u1887-\u18A8\u18AA\uA840-\uA871\u{10AC0}-\u{10AC4}\u{10AD3}-\u{10AD6}\u{10AD8}-\u{10ADC}\u{10ADE}-\u{10AE0}\u{10AEB}-\u{10AEE}\u{10B80}\u{10B82}\u{10B86}-\u{10B88}\u{10B8A}\u{10B8B}\u{10B8D}\u{10B90}\u{10BAD}\u{10BAE}\u{10D01}-\u{10D21}\u{10D23}\u{10EC3}\u{10EC4}\u{10F30}-\u{10F32}\u{10F34}-\u{10F44}\u{10F51}-\u{10F53}\u{10F70}-\u{10F73}\u{10F76}-\u{10F81}\u{10FB0}\u{10FB2}\u{10FB3}\u{10FB8}\u{10FBB}\u{10FBC}\u{10FBE}\u{10FBF}\u{10FC1}\u{10FC4}\u{10FCA}\u{1E900}-\u{1E943}]\p{Mn}*$/u;
    const zwj = checkJoiningTypeD.test(text.slice(0, viewIndex))
      ? "\u200D"
      : "";
    textNode.replaceData(
      viewIndex,
      text.length - viewIndex,
      !nodeContext.breakWord
        ? zwj + resolveHyphenateCharacter(nodeContext)
        : "",
    );
    if (zwj) {
      const p = nodeContext.preprocessedTextContent[0][1] as string;
      nodeContext.preprocessedTextContent[0][1] =
        p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
    }

MoamenAbdelsattar · 2024-11-07T14:51:44Z

Great! There is a final touch, I recently realized that zero-width joiner and Arabic Tatweel belong to \p{Joining_Type=C}, and they are dual joined, so we need to include these additional code points [\u0640\u07FA\u0883-\u0885\u180A\u200D], from https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BJoining_Type%3DC%7D&abb=on&esc=on&g=&i=

…t hyphen - fix vivliostyle#1414

MurakamiShinyu · 2024-11-08T02:20:08Z

@MoamenAbdelsattar Hi, I added a commit 03bc5ea, adding \p{Joining_Type=C} https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7BJoining_Type%3DD%7D%5Cp%7BJoining_Type%3DC%7D%5D&abb=on&esc=on&g=&i=. Please check this.

Disable ESLint "no-misleading-character-class" check: https://eslint.org/docs/latest/rules/no-misleading-character-class

MurakamiShinyu · 2024-11-08T14:03:58Z

Test

MurakamiShinyu

Thanks!

MoamenAbdelsattar · 2024-11-08T19:59:30Z

Thanks! It's now working as expected, I think, however, its performance can be improved for scalability, I mean this long regex pattern can take a lot of time to test, a prebuilt hashmap or a dictionary mapping a character to its joining type might be faster, but I'm not sure how much is the influence.

Fix Arabic words look separated when broken at soft hyphen.

f802e59

This change fixes the issue I described in vivliostyle#1414 It works, but I'm not sure whether it's clean code 😅 sorry.

MoamenAbdelsattar requested a review from MurakamiShinyu as a code owner November 6, 2024 16:04

vercel bot had a problem deploying to Preview November 6, 2024 16:16 Failure

fix: Make Arabic letters look connected before and after break at sof…

03bc5ea

…t hyphen - fix vivliostyle#1414

vercel bot deployed to Preview November 8, 2024 02:25 View deployment

prevent ESLint error no-misleading-character-class

907bc57

Disable ESLint "no-misleading-character-class" check: https://eslint.org/docs/latest/rules/no-misleading-character-class

vercel bot deployed to Preview November 8, 2024 03:05 View deployment

MurakamiShinyu approved these changes Nov 8, 2024

View reviewed changes

MurakamiShinyu merged commit 7553e21 into vivliostyle:master Nov 8, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Arabic words look separated when broken at soft hyphen #1415

Fix Arabic words look separated when broken at soft hyphen #1415

MoamenAbdelsattar commented Nov 6, 2024

vercel bot commented Nov 6, 2024 •

edited

Loading

CLAassistant commented Nov 6, 2024 •

edited

Loading

MurakamiShinyu commented Nov 7, 2024

MoamenAbdelsattar commented Nov 7, 2024 •

edited

Loading

MoamenAbdelsattar commented Nov 7, 2024

MoamenAbdelsattar commented Nov 7, 2024 •

edited

Loading

MurakamiShinyu commented Nov 7, 2024

MoamenAbdelsattar commented Nov 7, 2024 •

edited

Loading

MoamenAbdelsattar commented Nov 7, 2024

MoamenAbdelsattar commented Nov 7, 2024

MurakamiShinyu commented Nov 7, 2024

MoamenAbdelsattar commented Nov 7, 2024

MurakamiShinyu commented Nov 8, 2024

MurakamiShinyu commented Nov 8, 2024

MurakamiShinyu left a comment

MoamenAbdelsattar commented Nov 8, 2024

Fix Arabic words look separated when broken at soft hyphen #1415

Fix Arabic words look separated when broken at soft hyphen #1415

Conversation

MoamenAbdelsattar commented Nov 6, 2024

vercel bot commented Nov 6, 2024 • edited Loading

CLAassistant commented Nov 6, 2024 • edited Loading

MurakamiShinyu commented Nov 7, 2024

MoamenAbdelsattar commented Nov 7, 2024 • edited Loading

MoamenAbdelsattar commented Nov 7, 2024

MoamenAbdelsattar commented Nov 7, 2024 • edited Loading

MurakamiShinyu commented Nov 7, 2024

MoamenAbdelsattar commented Nov 7, 2024 • edited Loading

MoamenAbdelsattar commented Nov 7, 2024

MoamenAbdelsattar commented Nov 7, 2024

MurakamiShinyu commented Nov 7, 2024

MoamenAbdelsattar commented Nov 7, 2024

MurakamiShinyu commented Nov 8, 2024

MurakamiShinyu commented Nov 8, 2024

MurakamiShinyu left a comment

Choose a reason for hiding this comment

MoamenAbdelsattar commented Nov 8, 2024

vercel bot commented Nov 6, 2024 •

edited

Loading

CLAassistant commented Nov 6, 2024 •

edited

Loading

MoamenAbdelsattar commented Nov 7, 2024 •

edited

Loading

MoamenAbdelsattar commented Nov 7, 2024 •

edited

Loading

MoamenAbdelsattar commented Nov 7, 2024 •

edited

Loading