Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding CMaps are not actually parsed #1072

Open
dhdaines opened this issue Dec 13, 2024 · 1 comment
Open

Encoding CMaps are not actually parsed #1072

dhdaines opened this issue Dec 13, 2024 · 1 comment

Comments

@dhdaines
Copy link
Contributor

In theory pdfminer.six has a CMapParse which is capable of parsing arbitrary CMaps defined in the Encoding field of a Type0 font specification.

In practice, it doesn't do that at all... it only parses ToUnicode CMaps: https://github.com/search?q=repo%3Apdfminer%2Fpdfminer.six%20CMapParser&type=code

This is a problem because some PDFs will actually define their own, more exotic mappings of byte strings to CIDs in Type0 fonts. I don't fully (read: at all) understand that section of the PDF specification so I don't know if it is really standards compliant, but this pops up in some of the pdf.js samples, such as this one: https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue7901.pdf

One can minimally support that PDF by:

  1. Recognizing UTF-16BE in the Registry and Encoding fields (these are supposed to be ASCII, but...)
  2. Recognizing Adobe-Identity-UCS as a two-byte identity CMap (but you still have to parse it to get the WMode...)

You can see how PLAYA parses these, which is not entirely correct yet either, here, and it can pretty easily be ported back to pdfminer.six: dhdaines/playa#27

@dhdaines
Copy link
Contributor Author

I may submit a PR for this once I figure out the most robust way to do it (which is going to be "whatever pdf.js does", probably)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant