Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surrogate characters not working #63

Open
LokiMidgard opened this issue Aug 6, 2018 · 11 comments
Open

Surrogate characters not working #63

LokiMidgard opened this issue Aug 6, 2018 · 11 comments

Comments

@LokiMidgard
Copy link

Reporting an Issue Here

Surrogate characters (characters that does not fit in 2 bytes) will not drawn correctly.

Expected Behavior

Drawing string with surrogate characters (e.g. 🅐) should draw the correct glyph.

Actual Behavior

Two non recognizable characters are printed. The surrogate pair is interpreted as two separated characters.

Steps to Reproduce the Behavior

You can reproduce this with the minimal sample repository

@LokiMidgard
Copy link
Author

I was able to create a version of PDFsharp that supports surrogate characters. I needed to change some variables to uint and added support for format 12 cmap. You can review the changes and if your interested I can create a pull request (need to filter out some other changes I made to support .NetCore).

For the project I'm working on it made no problems. But I can't say I tested it toughly with many fonts.

@ThomasHoevel
Copy link
Member

Thanks for the feedback and your effort. Not much new code and the risk of breaking existing code should be minimal. I think a pull request won't be helpful as we use GitHub for distribution only.
Looks like a nice enhancement we should include with the next release. Thanks again.

@LokiMidgard
Copy link
Author

I'm glad if this will help others.

@LokiMidgard
Copy link
Author

If your text has surrogate characters, but don't have a format 12 table, my code currently throws an exception. previous behavior was printing two wrong characters.

@myobis
Copy link

myobis commented Dec 15, 2020

Hello 😃,
From what I see in the latest version (from GitHub), the above fix was not merged.
Is there a specific reason for this (considering that @ThomasHoevel agreed to include it ) ?

@chrfin
Copy link

chrfin commented Aug 24, 2021

Any update on this?
Is there a way to use @LokiMidgard version in combination with MigraDocXML?

@TheRealSourceSeeker
Copy link

TheRealSourceSeeker commented May 29, 2022

@LokiMidgard Thanks a lot for your additions! I applied them to PdfSharp source code version 1.50 beta 5, built the project in Visual Studio Community to get the DLL and added it from "src\PdfSharp\bin\Debug\PdfSharp.dll" to my project. To get emojis working in PDF, I had to use the font "Segoe UI Emoji" (I'm on Windows 10). This works fine, but two problems arise:

  1. Emojis are shown in black/white without color. Are color tables not supported? COLR/CPAL
  2. If there is more than one emoji in a XGraphics.DrawString, only the first one gets printed to PDF, the following ones are just blank chars!

Example:

System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
PdfDocument document = new PdfDocument();
PdfPage page = document.AddPage();
XGraphics gfx = XGraphics.FromPdfPage(page);
XPdfFontOptions options = new XPdfFontOptions(PdfFontEncoding.Unicode, PdfFontEmbedding.Always);
XFont font = new XFont("Segoe UI Emoji", 12, XFontStyle.Regular, options);
gfx.DrawString("111😢😞💪", font, XBrushes.Black, new XRect(0, 0, page.Width, page.Height), XStringFormats.Center);
document.Save("C:\\test.pdf");

I tried to debug, but I do not understand what RawUnicodeEncoding is. For the above string it's "<00140014001425F503B20662>". Numeral 1 corresponds to "0014". But I have no clue. Decimal value for ASCII 1 is 49. Manually counting glyphs in the font "Segoe UI Emoji", glyph 1 should be at position 19. What is the "0014" representing? It's not decimal ASCII nor glyph position in the font.

@LokiMidgard
Copy link
Author

It is too long since I worked with this…

  1. for color no idea

  2. no idea either…

My assumption would have been that 0014 is the index of the glyph in the font. Maybe there is an offset.
Did you look at different chas, maybe there are all of by the same ammont.

Did you test if the same emoji results in the same number the seccond time?

so 1😢1😢1 should be 001425F5001425F50014

@TheRealSourceSeeker
Copy link

Finally made progress!

  • Yes @LokiMidgard , the same emoji results in the same number the second time. Following is true:

so 1😢1😢1 should be 001425F5001425F50014

  • 0014 is indeed the glyph index position of glyph "1" in font "Segoe UI Emoji". Counting a font's glyph position manually was a bad idea of mine. A proper way to find correct glyph indices of a font is to use the freeware FontForge. Choose its menu "View/Goto", enter "glyph" and the glyph position you want to check. In this case I entered "glyph20", confirmed "OK" and glyph "1" got selected. That means glyph 1 is at position 20, so how does 0014 corresponds to it? gfx.DrawString("1... calls class "PdfSharp.Drawing.Pdf.XGraphicsPdfRenderer.cs" void "DrawString" where string "1" gets converted to char '1' respectively as decimal (char)49. Then char '1' is converted to uint 20, its glyph id. Then uint 20 is converted to (char)20 respectively as unicode char '\u0014'. All chars from the origin string from gfx.DrawString("1... are converted that way and appended to a StringBuilder.
    Emoji chars like the "Crying Face" (U+1F622) 😢 are no common UTF-16 chars, but UTF-32. This is where the surrogate pairs come into play, representing the UTF-32 as two UTF-16 chars. C# chars are 16 bit, 2^16 in decimal 65.536, ranging from U+0000 to U+FFFF, hence aren't capable of representing UTF-32, 2^32 in decimal 4.294.967.296, ranging up to U+FFFFFFFF. Compare unicode of above emoji U+1F622 which in surrogate pairs is U+D83D and U+DE22. So the "Crying Face" is not 1 but 2 chars long. The code from @LokiMidgard detects this and converts both chars, the one in "high surrogate" range U+D800 to U+DBFF and the one in "low surrogate" range U+DC00 to U+DFFF, to its correct glyph id int 9717 which is then converted to (char)9717 respectively char '\u25F5', the WHITE CIRCLE WITH LOWER LEFT QUADRANT and finally added to above mentioned StringBuilder.
  • Now for the progress: I was able to solve the second problem by adjusting class "PdfSharp.Fonts.CMapInfo.cs" to support writing multiple different emojis to PDF. The problem happens in said class's void "AddChars" at the line with the if-condition if (!CharacterToGlyphIndex.ContainsKey(ch)). As soon as an emoji is converted, the char ch always only contains the "high surrogate", but never the "low surrogate". Is this problematic? Well just compare the "high surrogate" of some emojis like the "Crying Face" 😢 (U+D83D U+DE22) with "Disappointed Face" 😞 (U+D83D U+DE1E). They are identical! As a result further emojis with identical "high surrogate" range will be excluded and get no glyph id. To solve this the public Dictionary<char, uint> CharacterToGlyphIndex should be adjusted to make its key take a char array (?) or a cleaner solution would be the Tuple type, but this requires at least .NET Framework 4.0 while current PDFSharp is coded for .NET Framework 2.0. However, side effects would be the necessity to also adjust the class "PdfSharp.Pdf.Advanced.PdfToUnicodeMap.cs" as it is using said Dictionary. Due to my limited knowledge, I didn't go that way, but added a little workaround. I will add a comment to the code you kindly shared @LokiMidgard , but for the sake of completeness, I will describe all my adjustments briefly:
    Changed classes: "PdfSharp.Fonts.CMapInfo.cs"
  1. Under public void AddChars(string text) the if-condition if (!CharacterToGlyphIndex.ContainsKey(ch)) was changed to:
if (!CharacterToGlyphIndex.ContainsKey(ch) || char.IsHighSurrogate(ch))
  1. Under said if-condition under if-condition if (char.IsHighSurrogate(ch)) following was added before the glyphIndex gets allocated:
// If high surrogate char hasn't been added yet, add high and low surrogate chars:
if (!SurrogatePairs.ContainsKey(ch))
    SurrogatePairs.Add(ch, new List<char>(text[idx + 1]));
// If high surrogate char has been added and low surrogate char hasn't been added yet, add low surrogate char:
else if (SurrogatePairs.ContainsKey(ch) && !SurrogatePairs[ch].Contains(text[idx + 1]))
    SurrogatePairs[ch].Add(text[idx + 1]);
// If high and low surrogate chars have been added, continue with next loop:
else
    continue;
  1. After allocation of both glyphIndex the line CharacterToGlyphIndex.Add(ch, glyphIndex); was changed to:
if (!CharacterToGlyphIndex.ContainsKey(ch)) // To do (for support of reading PDF?): Surrogate pair chars with same high surrogate chars and different low surrogate chars are missing in "CharacterToGlyphIndex"!
    CharacterToGlyphIndex.Add(ch, glyphIndex);
  1. At the very bottom, where the two public chars and Dictionaries are declared, following Dictionary was added:
private Dictionary<char, List<char>> SurrogatePairs = new Dictionary<char, List<char>>();

@LokiMidgard
Copy link
Author

@TheRealSourceSeeker Thank you for your work. I hope in a forseable future I get to a point where I will need this code. Since I updated my branch already for dotNet 6 (I think). Tuple should not be a problem :)

@TheRealSourceSeeker
Copy link

TheRealSourceSeeker commented Jun 17, 2022

@LokiMidgard Could you report back your result for printing string "1♥️1" to PDF? (The unicode char ♥️ is not encoded in UTF-32, but UTF-16, so it doesn't depend on support for surrogate pairs!)

System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
PdfDocument document = new PdfDocument();
PdfPage page = document.AddPage();
XGraphics gfx = XGraphics.FromPdfPage(page);
XPdfFontOptions options = new XPdfFontOptions(PdfFontEncoding.Unicode, PdfFontEmbedding.Always);
XFont font = new XFont("Segoe UI Emoji", 12, XFontStyle.Regular, options);
gfx.DrawString("1♥️1", font, XBrushes.Black, new XRect(0, 0, page.Width, page.Height), XStringFormats.Center);
document.Save("C:\\test.pdf");

Does your PDF show "1♥️1" or "1♥️ 1"? In my case with font "Segoe UI Emoji" I face the latter result. I already have another workaround by just skipping the "Variation Selector-16" (Unicode decimal: 65039, hexadecimal: FE0F), but I don't know if this could lead to bugs for other fonts. Adjustments for the workaround:
Changed classes: "PdfSharp.Drawing.Pdf.XGraphicsPdfRenderer"
Under public void DrawString(string s, XFont font, XBrush brush, XRect rect, XStringFormat format), under if (font.Unicode), under for (int idx = 0; idx < s.Length; idx++), below line char ch = s[idx]; add:

// Skip "Variation Selector-16" (Unicode decimal: 65039, hexadecimal: FE0F)
// as long as colored emojis aren't supported (only black/white "text presentation", no colored "emoji presentation"):
// Reason: Char "♥️" triggers 2 char matches, writing a visual heart and space to PDF:
// 1. "Black Heart Suit"      (hexadecimal: 2665, decimal: 9829)
// 2. "Variation Selector-16" (hexadecimal: FE0F, decimal: 65039)
if (ch == 65039)
    continue;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants