Replies: 2 comments 6 replies
-
Thank you for your compliments! There already is an option which causes the subset tags to be maintained in the text extraction functions ( |
Beta Was this translation helpful? Give feedback.
1 reply
-
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey there again Jorj, I feel like I didn't communicate this well in a previous discussion I had with you, but this library is freakin fantastic. Love it.
I just recently built the package from source in order to incorporate something I need for what I'm working on.
Basically, when building the textDict, spans have their 6 character prefix stripped in JM_font_name() in helper-stext.i
I simply changed that function from
to
I'm working on something where the exact bounding boxes of characters is absolutely necessary, so I've been extracting the font files from the pdf and computing the bounding boxes using the font files. (Which, as a feature enhancement would be pretty freakin sweet if included as an additional option when using extractDict() - perhaps that's for another discussion.)
Anyway, as discussed in Issue #739 it appears that the 6 character prefix was removed on purpose, but what I've found is that often enough, you'll get partial character maps per page for someone's pdf. While if the pdf creator was careful about how they created the pdf/didn't use some jank software to edit/create their pdf, you'll find that the prefix is unnecessary (presumably), some pdfs have a different set of character maps per page despite being of the same font. The end result is you need someway of resolving the characters to the map from which they originated, and you can't do that without the prefix.
So, what I'm wondering is if we could add an option that gets passed all the way down to JM_font_name() from extractDict() that preserves those prefixes? Perhaps we could call it preserve_font_prefixes=True|False? I believe they've been removed for to reduce datastructure size.
Would love to hear your thoughts!
Best,
Aaron
Beta Was this translation helpful? Give feedback.
All reactions