Surrogate characters not working #63

LokiMidgard · 2018-08-06T11:05:36Z

Reporting an Issue Here

Surrogate characters (characters that does not fit in 2 bytes) will not drawn correctly.

Expected Behavior

Drawing string with surrogate characters (e.g. 🅐) should draw the correct glyph.

Actual Behavior

Two non recognizable characters are printed. The surrogate pair is interpreted as two separated characters.

Steps to Reproduce the Behavior

You can reproduce this with the minimal sample repository

LokiMidgard · 2018-08-06T11:12:52Z

I was able to create a version of PDFsharp that supports surrogate characters. I needed to change some variables to uint and added support for format 12 cmap. You can review the changes and if your interested I can create a pull request (need to filter out some other changes I made to support .NetCore).

For the project I'm working on it made no problems. But I can't say I tested it toughly with many fonts.

ThomasHoevel · 2018-08-06T15:46:58Z

Thanks for the feedback and your effort. Not much new code and the risk of breaking existing code should be minimal. I think a pull request won't be helpful as we use GitHub for distribution only.
Looks like a nice enhancement we should include with the next release. Thanks again.

LokiMidgard · 2018-08-06T18:45:57Z

I'm glad if this will help others.

LokiMidgard · 2018-08-07T08:06:26Z

If your text has surrogate characters, but don't have a format 12 table, my code currently throws an exception. previous behavior was printing two wrong characters.

myobis · 2020-12-15T17:01:44Z

Hello 😃,
From what I see in the latest version (from GitHub), the above fix was not merged.
Is there a specific reason for this (considering that @ThomasHoevel agreed to include it ) ?

chrfin · 2021-08-24T12:11:59Z

Any update on this?
Is there a way to use @LokiMidgard version in combination with MigraDocXML?

TheRealSourceSeeker · 2022-05-29T12:20:46Z

@LokiMidgard Thanks a lot for your additions! I applied them to PdfSharp source code version 1.50 beta 5, built the project in Visual Studio Community to get the DLL and added it from "src\PdfSharp\bin\Debug\PdfSharp.dll" to my project. To get emojis working in PDF, I had to use the font "Segoe UI Emoji" (I'm on Windows 10). This works fine, but two problems arise:

Emojis are shown in black/white without color. Are color tables not supported? COLR/CPAL
If there is more than one emoji in a XGraphics.DrawString, only the first one gets printed to PDF, the following ones are just blank chars!

Example:

System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
PdfDocument document = new PdfDocument();
PdfPage page = document.AddPage();
XGraphics gfx = XGraphics.FromPdfPage(page);
XPdfFontOptions options = new XPdfFontOptions(PdfFontEncoding.Unicode, PdfFontEmbedding.Always);
XFont font = new XFont("Segoe UI Emoji", 12, XFontStyle.Regular, options);
gfx.DrawString("111😢😞💪", font, XBrushes.Black, new XRect(0, 0, page.Width, page.Height), XStringFormats.Center);
document.Save("C:\\test.pdf");

I tried to debug, but I do not understand what RawUnicodeEncoding is. For the above string it's "<00140014001425F503B20662>". Numeral 1 corresponds to "0014". But I have no clue. Decimal value for ASCII 1 is 49. Manually counting glyphs in the font "Segoe UI Emoji", glyph 1 should be at position 19. What is the "0014" representing? It's not decimal ASCII nor glyph position in the font.

LokiMidgard · 2022-05-29T19:27:36Z

It is too long since I worked with this…

for color no idea
no idea either…

My assumption would have been that 0014 is the index of the glyph in the font. Maybe there is an offset.
Did you look at different chas, maybe there are all of by the same ammont.

Did you test if the same emoji results in the same number the seccond time?

so 1😢1😢1 should be 001425F5001425F50014

TheRealSourceSeeker · 2022-06-12T14:40:06Z

Finally made progress!

Yes @LokiMidgard , the same emoji results in the same number the second time. Following is true:

so 1😢1😢1 should be 001425F5001425F50014

0014 is indeed the glyph index position of glyph "1" in font "Segoe UI Emoji". Counting a font's glyph position manually was a bad idea of mine. A proper way to find correct glyph indices of a font is to use the freeware FontForge. Choose its menu "View/Goto", enter "glyph" and the glyph position you want to check. In this case I entered "glyph20", confirmed "OK" and glyph "1" got selected. That means glyph 1 is at position 20, so how does 0014 corresponds to it? gfx.DrawString("1... calls class "PdfSharp.Drawing.Pdf.XGraphicsPdfRenderer.cs" void "DrawString" where string "1" gets converted to char '1' respectively as decimal (char)49. Then char '1' is converted to uint 20, its glyph id. Then uint 20 is converted to (char)20 respectively as unicode char '\u0014'. All chars from the origin string from gfx.DrawString("1... are converted that way and appended to a StringBuilder.
Emoji chars like the "Crying Face" (U+1F622) 😢 are no common UTF-16 chars, but UTF-32. This is where the surrogate pairs come into play, representing the UTF-32 as two UTF-16 chars. C# chars are 16 bit, 2^16 in decimal 65.536, ranging from U+0000 to U+FFFF, hence aren't capable of representing UTF-32, 2^32 in decimal 4.294.967.296, ranging up to U+FFFFFFFF. Compare unicode of above emoji U+1F622 which in surrogate pairs is U+D83D and U+DE22. So the "Crying Face" is not 1 but 2 chars long. The code from @LokiMidgard detects this and converts both chars, the one in "high surrogate" range U+D800 to U+DBFF and the one in "low surrogate" range U+DC00 to U+DFFF, to its correct glyph id int 9717 which is then converted to (char)9717 respectively char '\u25F5', the WHITE CIRCLE WITH LOWER LEFT QUADRANT and finally added to above mentioned StringBuilder.
Now for the progress: I was able to solve the second problem by adjusting class "PdfSharp.Fonts.CMapInfo.cs" to support writing multiple different emojis to PDF. The problem happens in said class's void "AddChars" at the line with the if-condition if (!CharacterToGlyphIndex.ContainsKey(ch)). As soon as an emoji is converted, the char ch always only contains the "high surrogate", but never the "low surrogate". Is this problematic? Well just compare the "high surrogate" of some emojis like the "Crying Face" 😢 (U+D83D U+DE22) with "Disappointed Face" 😞 (U+D83D U+DE1E). They are identical! As a result further emojis with identical "high surrogate" range will be excluded and get no glyph id. To solve this the public Dictionary<char, uint> CharacterToGlyphIndex should be adjusted to make its key take a char array (?) or a cleaner solution would be the Tuple type, but this requires at least .NET Framework 4.0 while current PDFSharp is coded for .NET Framework 2.0. However, side effects would be the necessity to also adjust the class "PdfSharp.Pdf.Advanced.PdfToUnicodeMap.cs" as it is using said Dictionary. Due to my limited knowledge, I didn't go that way, but added a little workaround. I will add a comment to the code you kindly shared @LokiMidgard , but for the sake of completeness, I will describe all my adjustments briefly:
Changed classes: "PdfSharp.Fonts.CMapInfo.cs"

Under public void AddChars(string text) the if-condition if (!CharacterToGlyphIndex.ContainsKey(ch)) was changed to:

if (!CharacterToGlyphIndex.ContainsKey(ch) || char.IsHighSurrogate(ch))

Under said if-condition under if-condition if (char.IsHighSurrogate(ch)) following was added before the glyphIndex gets allocated:

// If high surrogate char hasn't been added yet, add high and low surrogate chars:
if (!SurrogatePairs.ContainsKey(ch))
    SurrogatePairs.Add(ch, new List<char>(text[idx + 1]));
// If high surrogate char has been added and low surrogate char hasn't been added yet, add low surrogate char:
else if (SurrogatePairs.ContainsKey(ch) && !SurrogatePairs[ch].Contains(text[idx + 1]))
    SurrogatePairs[ch].Add(text[idx + 1]);
// If high and low surrogate chars have been added, continue with next loop:
else
    continue;

After allocation of both glyphIndex the line CharacterToGlyphIndex.Add(ch, glyphIndex); was changed to:

if (!CharacterToGlyphIndex.ContainsKey(ch)) // To do (for support of reading PDF?): Surrogate pair chars with same high surrogate chars and different low surrogate chars are missing in "CharacterToGlyphIndex"!
    CharacterToGlyphIndex.Add(ch, glyphIndex);

At the very bottom, where the two public chars and Dictionaries are declared, following Dictionary was added:

private Dictionary<char, List<char>> SurrogatePairs = new Dictionary<char, List<char>>();

LokiMidgard · 2022-06-13T09:20:15Z

@TheRealSourceSeeker Thank you for your work. I hope in a forseable future I get to a point where I will need this code. Since I updated my branch already for dotNet 6 (I think). Tuple should not be a problem :)

TheRealSourceSeeker · 2022-06-17T00:42:53Z

@LokiMidgard Could you report back your result for printing string "1♥️1" to PDF? (The unicode char ♥️ is not encoded in UTF-32, but UTF-16, so it doesn't depend on support for surrogate pairs!)

System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
PdfDocument document = new PdfDocument();
PdfPage page = document.AddPage();
XGraphics gfx = XGraphics.FromPdfPage(page);
XPdfFontOptions options = new XPdfFontOptions(PdfFontEncoding.Unicode, PdfFontEmbedding.Always);
XFont font = new XFont("Segoe UI Emoji", 12, XFontStyle.Regular, options);
gfx.DrawString("1♥️1", font, XBrushes.Black, new XRect(0, 0, page.Width, page.Height), XStringFormats.Center);
document.Save("C:\\test.pdf");

Does your PDF show "1♥️1" or "1♥️ 1"? In my case with font "Segoe UI Emoji" I face the latter result. I already have another workaround by just skipping the "Variation Selector-16" (Unicode decimal: 65039, hexadecimal: FE0F), but I don't know if this could lead to bugs for other fonts. Adjustments for the workaround:
Changed classes: "PdfSharp.Drawing.Pdf.XGraphicsPdfRenderer"
Under public void DrawString(string s, XFont font, XBrush brush, XRect rect, XStringFormat format), under if (font.Unicode), under for (int idx = 0; idx < s.Length; idx++), below line char ch = s[idx]; add:

// Skip "Variation Selector-16" (Unicode decimal: 65039, hexadecimal: FE0F)
// as long as colored emojis aren't supported (only black/white "text presentation", no colored "emoji presentation"):
// Reason: Char "♥️" triggers 2 char matches, writing a visual heart and space to PDF:
// 1. "Black Heart Suit"      (hexadecimal: 2665, decimal: 9829)
// 2. "Variation Selector-16" (hexadecimal: FE0F, decimal: 65039)
if (ch == 65039)
    continue;

ThomasHoevel added the enhancement label Aug 6, 2018

ThomasHoevel mentioned this issue Nov 7, 2019

Emoji doesn't render correctly in PDF empira/MigraDoc-1.5#29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surrogate characters not working #63

Surrogate characters not working #63

LokiMidgard commented Aug 6, 2018

LokiMidgard commented Aug 6, 2018

ThomasHoevel commented Aug 6, 2018

LokiMidgard commented Aug 6, 2018

LokiMidgard commented Aug 7, 2018

myobis commented Dec 15, 2020

chrfin commented Aug 24, 2021

TheRealSourceSeeker commented May 29, 2022 •

edited

Loading

LokiMidgard commented May 29, 2022

TheRealSourceSeeker commented Jun 12, 2022

LokiMidgard commented Jun 13, 2022

TheRealSourceSeeker commented Jun 17, 2022 •

edited

Loading

Surrogate characters not working #63

Surrogate characters not working #63

Comments

LokiMidgard commented Aug 6, 2018

Reporting an Issue Here

Expected Behavior

Actual Behavior

Steps to Reproduce the Behavior

LokiMidgard commented Aug 6, 2018

ThomasHoevel commented Aug 6, 2018

LokiMidgard commented Aug 6, 2018

LokiMidgard commented Aug 7, 2018

myobis commented Dec 15, 2020

chrfin commented Aug 24, 2021

TheRealSourceSeeker commented May 29, 2022 • edited Loading

LokiMidgard commented May 29, 2022

TheRealSourceSeeker commented Jun 12, 2022

LokiMidgard commented Jun 13, 2022

TheRealSourceSeeker commented Jun 17, 2022 • edited Loading

TheRealSourceSeeker commented May 29, 2022 •

edited

Loading

TheRealSourceSeeker commented Jun 17, 2022 •

edited

Loading