-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Arabic words look separated when broken at soft hyphen #1415
Conversation
This change fixes the issue I described in vivliostyle#1414 It works, but I'm not sure whether it's clean code 😅 sorry.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@MoamenAbdelsattar Thank you for your contribution! I checked your code and found some problems. vivliostyle.js/packages/core/src/vivliostyle/layout.ts Lines 3913 to 3919 in f802e59
First, the build failed as shown in https://github.com/vivliostyle/vivliostyle.js/actions/runs/11707366253/job/32629434013:
This error can be fixed by adding You seem to have added a zero-width joiner character I made a fixed version as follows: breakAfterSoftHyphen(
textNode: Text,
text: string,
viewIndex: number,
nodeContext: Vtree.NodeContext,
): number {
// convert trailing soft hyphen to a real hyphen
// if the previous character is a shaping character, such as Arabic,
// insert a zero-width joiner before the hyphen.
const prev = text.charCodeAt(viewIndex - 1);
const zwj = prev >= 0x600 && prev <= 0x7af ? "\u200d" : "";
textNode.replaceData(
viewIndex,
text.length - viewIndex,
!nodeContext.breakWord
? zwj + resolveHyphenateCharacter(nodeContext)
: "",
);
return viewIndex + 1;
} I tested this fixed version with some test text with Arabic letters with soft hyphens (uyghur-shy.html), and it worked as expected. Also it does not affect other languages that do not require using zero-width joiner (english-shy.html). One more request: Please prefix the commit message with "fix: " when fixing a bug. |
Hi @MurakamiShinyu The critical Zero width joiner must be added both before and after the break. The separated form of letter before the break is not critical and can be fixed by modifying Regarding languages that don't require zero-width joiner I think zero-width joiner will not affect shaping, so I was thinking that as an optimization this step could be ignored. |
Please note: some Arabic letters are not connected to the letter after them, these are ا أ إ آ ر ز د ذ و ء, if a soft hyphen is present after them: the first letter in the broken line should not be in the connected form. In my case I wasn't inserting any soft-hyphen after them, so I forgot to take them into account. |
Unicode provides a property that tells whether a character is Joined or not, However javascript seems to not support reading this property from regex. Zero-width joiner must be added both before and after the break, but the ZWJ after the break must be only added if the last letter (and not diacritical mark) before the break has |
If const prev = text.charAt(viewIndex - 1);
const zwj = /\p{Joining_Type=Dual_Joining}/u // ERROR: "Unknown Unicode property name."
.test(prev)
? "\u200d"
: "";
textNode.replaceData(
viewIndex,
text.length - viewIndex,
!nodeContext.breakWord
? zwj + resolveHyphenateCharacter(nodeContext)
: "",
);
if (zwj) {
const p = nodeContext.preprocessedTextContent[0][1] as string;
nodeContext.preprocessedTextContent[0][1] =
p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
} |
Please note: the previous character could be a diacritical mark, which doesn't affect joining ( let i = viewIndex - 1;
let prev;
while(i >= 0){
prev = text.charAt(i);
if(prev.match(/\p{Joining_Type=Transparent}/u) i--; // skip this character
else break;
}
const zwj = /\p{Joining_Type=Dual_Joining}/u // ERROR: "Unknown Unicode property name."
.test(prev)
? "\u200d"
: "";
textNode.replaceData(
viewIndex,
text.length - viewIndex,
!nodeContext.breakWord
? zwj + resolveHyphenateCharacter(nodeContext)
: "",
);
if (zwj) {
const p = nodeContext.preprocessedTextContent[0][1] as string;
nodeContext.preprocessedTextContent[0][1] =
p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
} |
Could https://github.com/foliojs/codepoints Help? |
Hi, I've made https://github.com/MoamenAbdelsattar/ArabicShapingJS as a library as it can be a useful component of other projects. Now the code can be as following: let i = viewIndex - 1;
let prev;
while(i >= 0){
prev = text.charAt(i);
if(joiningType[prev] == "T") i--; // skip this character
else break;
}
const zwj = joiningType[prev] == "D"
? "\u200d"
: "";
textNode.replaceData(
viewIndex,
text.length - viewIndex,
!nodeContext.breakWord
? zwj + resolveHyphenateCharacter(nodeContext)
: "",
);
if (zwj) {
const p = nodeContext.preprocessedTextContent[0][1] as string;
nodeContext.preprocessedTextContent[0][1] =
p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
} This code works as expected for me. |
The following code may be more compact, const checkJoiningTypeD =
// /\p{Joining_Type=D}\p{Mn}*$/
/[\u0620\u0626\u0628\u062A-\u062E\u0633-\u063F\u0641-\u0647\u0649\u064A\u066E\u066F\u0678-\u0687\u069A-\u06BF\u06C1\u06C2\u06CC\u06CE\u06D0\u06D1\u06FA-\u06FC\u06FF\u0712-\u0714\u071A-\u071D\u071F-\u0727\u0729\u072B\u072D\u072E\u074E-\u0758\u075C-\u076A\u076D-\u0770\u0772\u0775-\u0777\u077A-\u077F\u07CA-\u07EA\u0841-\u0845\u0848\u084A-\u0853\u0855\u0860\u0862-\u0865\u0868\u0886\u0889-\u088D\u08A0-\u08A9\u08AF\u08B0\u08B3-\u08B8\u08BA-\u08C8\u1807\u1820-\u1878\u1887-\u18A8\u18AA\uA840-\uA871\u{10AC0}-\u{10AC4}\u{10AD3}-\u{10AD6}\u{10AD8}-\u{10ADC}\u{10ADE}-\u{10AE0}\u{10AEB}-\u{10AEE}\u{10B80}\u{10B82}\u{10B86}-\u{10B88}\u{10B8A}\u{10B8B}\u{10B8D}\u{10B90}\u{10BAD}\u{10BAE}\u{10D01}-\u{10D21}\u{10D23}\u{10EC3}\u{10EC4}\u{10F30}-\u{10F32}\u{10F34}-\u{10F44}\u{10F51}-\u{10F53}\u{10F70}-\u{10F73}\u{10F76}-\u{10F81}\u{10FB0}\u{10FB2}\u{10FB3}\u{10FB8}\u{10FBB}\u{10FBC}\u{10FBE}\u{10FBF}\u{10FC1}\u{10FC4}\u{10FCA}\u{1E900}-\u{1E943}]\p{Mn}*$/u;
const zwj = checkJoiningTypeD.test(text.slice(0, viewIndex))
? "\u200D"
: "";
textNode.replaceData(
viewIndex,
text.length - viewIndex,
!nodeContext.breakWord
? zwj + resolveHyphenateCharacter(nodeContext)
: "",
);
if (zwj) {
const p = nodeContext.preprocessedTextContent[0][1] as string;
nodeContext.preprocessedTextContent[0][1] =
p.slice(0, viewIndex + 1) + zwj + p.slice(viewIndex + 1);
} |
Great! There is a final touch, I recently realized that zero-width joiner and Arabic Tatweel belong to |
@MoamenAbdelsattar Hi, I added a commit 03bc5ea, adding |
Disable ESLint "no-misleading-character-class" check: https://eslint.org/docs/latest/rules/no-misleading-character-class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Thanks! It's now working as expected, I think, however, its performance can be improved for scalability, I mean this long regex pattern can take a lot of time to test, a prebuilt hashmap or a dictionary mapping a character to its joining type might be faster, but I'm not sure how much is the influence. |
This change fixes the issue I described in #1414 It works, but I'm not sure whether it's clean code 😅 sorry.