You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a problem because some PDFs will actually define their own, more exotic mappings of byte strings to CIDs in Type0 fonts. I don't fully (read: at all) understand that section of the PDF specification so I don't know if it is really standards compliant, but this pops up in some of the pdf.js samples, such as this one: https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue7901.pdf
One can minimally support that PDF by:
Recognizing UTF-16BE in the Registry and Encoding fields (these are supposed to be ASCII, but...)
Recognizing Adobe-Identity-UCS as a two-byte identity CMap (but you still have to parse it to get the WMode...)
You can see how PLAYA parses these, which is not entirely correct yet either, here, and it can pretty easily be ported back to pdfminer.six: dhdaines/playa#27
The text was updated successfully, but these errors were encountered:
In theory
pdfminer.six
has aCMapParse
which is capable of parsing arbitrary CMaps defined in theEncoding
field of a Type0 font specification.In practice, it doesn't do that at all... it only parses
ToUnicode
CMaps: https://github.com/search?q=repo%3Apdfminer%2Fpdfminer.six%20CMapParser&type=codeThis is a problem because some PDFs will actually define their own, more exotic mappings of byte strings to CIDs in Type0 fonts. I don't fully (read: at all) understand that section of the PDF specification so I don't know if it is really standards compliant, but this pops up in some of the pdf.js samples, such as this one: https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue7901.pdf
One can minimally support that PDF by:
Registry
andEncoding
fields (these are supposed to be ASCII, but...)Adobe-Identity-UCS
as a two-byte identity CMap (but you still have to parse it to get theWMode
...)You can see how PLAYA parses these, which is not entirely correct yet either, here, and it can pretty easily be ported back to
pdfminer.six
: dhdaines/playa#27The text was updated successfully, but these errors were encountered: