Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build out the name table decoding function to cover all platform/encodings #85

Open
Pomax opened this issue Sep 9, 2020 · 6 comments
Open
Labels
enhancement Making working code work better. help welcome Want to help out? Have a look at issues tagged with this label.

Comments

@Pomax
Copy link
Owner

Pomax commented Sep 9, 2020

Right now it's using a fairly "naive" UTF16 decoding for anything with platformID 0 (Unicode) or 3 (Microsoft), with "ascii" byte decoding for anything else, but that glosses over a fair number of platformId/encodingId combinations, so... if someone wants to help out implementing all the various string decodings, let me know!

@Pomax Pomax added help welcome Want to help out? Have a look at issues tagged with this label. enhancement Making working code work better. labels Sep 9, 2020
@RoelN
Copy link
Collaborator

RoelN commented Dec 17, 2020

Is this why

Mike Abbink, Paul van der Laan, Pieter van Rosmalen, Ben Mitchell, Mark Frömberg

gets output by lib-font as

Mike Abbink, Paul van der Laan, Pieter van Rosmalen, Ben Mitchell, Mark Fr�mberg?

I'm testing this with https://github.com/IBM/plex/blob/master/IBM-Plex-Sans-Thai/fonts/complete/ttf/IBMPlexSansThai-Light.ttf

@Pomax
Copy link
Owner Author

Pomax commented Dec 18, 2020

that sounds more like whatever is rendering that text is not set to render as utf-8: where are you seeing this output? =)

@RoelN
Copy link
Collaborator

RoelN commented Dec 18, 2020

I'm seeing this on the command line, but also in the Wakamai Fondue output. This is the test script I used to verify this already happened at the lib-font side (as opposed to some data kneading on the WF side):

import { Font } from "./lib-font.js";

const font = new Font("testfont");
font.src = "./IBMPlexSansThai-Light.ttf";

font.onload = (evt) => {
  let font = evt.detail.font;
  const { name } = font.opentype.tables;
  console.log(name.get(9));
};

@skyeewers
Copy link

Has there been any progress on this issue? I think I'm running into the same issue as @RoelN with non-ascii characters just showing as the character-not-found-"?"-character, but I don't know enough about the encoding side of things here to try and fix this myself 😓.

@Pomax
Copy link
Owner Author

Pomax commented Sep 29, 2022

The commandline is notorioulsy bad at utf8, so, a small change to make this easier to test with:

import fs from "fs";
import { Font } from "./lib-font.js";

const font = new Font("testfont");
font.src = "IBMPlexSansThai-Light.ttf";

font.onload = (evt) => {
  let font = evt.detail.font;
  const { name } = font.opentype.tables;
  fs.writeFileSync(`test.out`, `name: ${name.get(9)}`, `utf-8`);
};

Yields

name: Mike Abbink, Paul van der Laan, Pieter van Rosmalen, Ben Mitchell, Mark Fr�mberg

The bad character there is a 0x009A, which is clearly wrong. Let's do some byte checks. Throwing this into an inspector:

image

Gives use the following data block to inspect:

4D 69 6B 65 20 41 62 62 69 6E 6B 2C 20 50 61 75
6C 20 76 61 6E 20 64 65 72 20 4C 61 61 6E 2C 20
50 69 65 74 65 72 20 76 61 6E 20 52 6F 73 6D 61
6C 65 6E 2C 20 42 65 6E 20 4D 69 74 63 68 65 6C
6C 2C 20 4D 61 72 6B 20 46 72 9A 6D 62 65 72 67

And indeed, those last few bytes are "r", 0x9A, "m", "b", "e", "r", and "g" if interpreted as ASCII... so it's not a matter of reading the bytes wrong.

We also see this uses platformID=1, platEncID=0 and langID=0, which means we should be treating this as Mac/Roman/English. sooooo we look up that encoding and find that 0x9A should be "ö"

So this is definitely a decoding issue.

@Pomax
Copy link
Owner Author

Pomax commented Sep 29, 2022

I'm of two minds here.

1: we add all possible decoding schemes to lib-font, blowing it up to an incredible size, but make all strings come out as well-behaved UTF8
2: make this the consuming code's responsibility, with lib-font giving you the bytes, and the information you need to know what encoding it's using, but not performing automagical conversion to UTF8

And honestly, I'm leaning heavily towards (2) because it doesn't make sense to bake string encoding conversion into this library rather than making than an "if you need it, you know better than I do how to slot that into your own code base".

That said, we could make that a separate project (if it doesn't already exist!) and do something clever like an optional decoder argument to the Font constructor so that if there is on, strings can be magic'd, and if there isn't, you might need to do your own decoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Making working code work better. help welcome Want to help out? Have a look at issues tagged with this label.
Projects
None yet
Development

No branches or pull requests

3 participants