Create an extra option to extractDict() to preserve original font-names (computing bboxes by font) - IMPLEMENTED #1132

inf3rnus · 2021-07-08T20:18:28Z

inf3rnus
Jul 8, 2021

Hey there again Jorj, I feel like I didn't communicate this well in a previous discussion I had with you, but this library is freakin fantastic. Love it.

I just recently built the package from source in order to incorporate something I need for what I'm working on.

Basically, when building the textDict, spans have their 6 character prefix stripped in JM_font_name() in helper-stext.i

I simply changed that function from

static const char *
JM_font_name(fz_context *ctx, fz_font *font)
{
    const char *name = fz_font_name(ctx, font);
    const char *s = strchr(name, '+');
    if (subset_fontnames || s == NULL || s-name != 6) {
        return name;
    }
    return s + 1;
}

to

static const char *
JM_font_name(fz_context *ctx, fz_font *font)
{
    const char *name = fz_font_name(ctx, font);
    return name;
}

I'm working on something where the exact bounding boxes of characters is absolutely necessary, so I've been extracting the font files from the pdf and computing the bounding boxes using the font files. (Which, as a feature enhancement would be pretty freakin sweet if included as an additional option when using extractDict() - perhaps that's for another discussion.)

Anyway, as discussed in Issue #739 it appears that the 6 character prefix was removed on purpose, but what I've found is that often enough, you'll get partial character maps per page for someone's pdf. While if the pdf creator was careful about how they created the pdf/didn't use some jank software to edit/create their pdf, you'll find that the prefix is unnecessary (presumably), some pdfs have a different set of character maps per page despite being of the same font. The end result is you need someway of resolving the characters to the map from which they originated, and you can't do that without the prefix.

So, what I'm wondering is if we could add an option that gets passed all the way down to JM_font_name() from extractDict() that preserves those prefixes? Perhaps we could call it preserve_font_prefixes=True|False? I believe they've been removed for to reduce datastructure size.

Would love to hear your thoughts!

Best,
Aaron

JorjMcKie · 2021-07-08T21:51:52Z

JorjMcKie
Jul 8, 2021
Maintainer

Thank you for your compliments!

There already is an option which causes the subset tags to be maintained in the text extraction functions (Tools method):

1 reply

inf3rnus Jul 9, 2021
Author

Well, I guess this is another example of needing to check the docs better before delving down a rabit hole 😆, thank you!

JorjMcKie · 2021-07-08T21:58:26Z

JorjMcKie
Jul 8, 2021
Maintainer

As for exact bounding boxes:
There is the Font class with quite a number of methods and properties of fonts. Among them font bbox and glyph bbox.
To address potential font issues, like inexact ascender / descender values, you can also globally request to minimize glyph bboxes such that they no longer return the font's line height, which normally is (ascender - descender) * fontsize, but a smaller value which should restrict this to the visible glyph's height.
This option is:

in the Tools class.

5 replies

JorjMcKie Jul 8, 2021
Maintainer

The option in previous post causes the blue rectangle being returned, instead of the red one:

inf3rnus Jul 9, 2021
Author

Hey there again Jorj,

I did learn of the tightened/reduced bboxes, but I found that they do not always work, and for some pdfs, (I only have one example for the time being), can still produce inaccurate bounding boxes. Here's an example:

https://arxiv.org/pdf/1912.03310.pdf

^ what seems to be problematic is the X symbol in the image above. (I should mention my goal right now is to as tightly as possible enclose a sequence of characters, so that no empty space is present. e.g If a glyph only uses space above the baseline in the ascender, then the bbox encloses only that portion, excluding the descender.)

To get around this, I've been rendering the fonts in PIL, and then computing the bboxes based off of the font metrics for a given font to then get bboxes that fit around a word perfectly by doing this for each character.

Although I'm now running into problems with certain font files not having the correct mapping between the codepoint to glyph, which is causing this to break down a bit. e.g. I'll take the character's string value returned from PyMuPdf and tell PIL to render that character with the font that belongs to that character. However, this breaks down because for some fonts, there are no codepoints (inspected with FontForge), and sometimes all of the code points work except for one or a few.

I've extracted the fonts from some of the pdfs I'm working with using tools other than PyMuPdf, and they all return the same results, so it's not an issue with this library as far as the actual contents of the files goes.

So, what I'm now wondering is if, in theory, mupdf returns all the information required to compute a glyph's bounding box? My mental model of things is that the pdf contains a series of instructions for drawing each character, with each character having a certain encoding (utf8, etc.) which is a codepoint into a font file, and then the pdf reader uses the origin of a given character in combination with the information in the font file to draw the character onto the canvas representing what is to be displayed to the user.

Is there a place in PyMuPDF where the resultant bounding box of that draw instruction can be extracted, or am I off on something here?

Thanks for all your insight!

Best,
Aaron

JorjMcKie Jul 9, 2021
Maintainer

All info you can get a hold of based on MuPDF is the font and its properties and glyphs. Take a look at the Font class to see details.
So when you know the font for the "X" symbol, you can find the bbox for the respective glyph. When I compute the reduced glyph bbox height, I use the glyph quad and the glyph "origin" (the bottom-left point). Then I take the ascender / descender values to compute the bits above and below the y-component of the origin such that their sum equals font size. That's it what is possible.

s there a place in PyMuPDF where the resultant bounding box of that draw instruction can be extracted,

There is no access to the graphical draw commands for the glyph itself. This information is anyway not contained in PDF but in the font's binary file.

I don't try to find out if a specific glyph actually does have anything dangling below the origin's y-value (like a "g" or a "y").
In addition, fonts often are defined like sh*t and have no individually different glyph bboxes and / or no valid ascender / descender values and / or tell the truth about being serifed or bold or ...

Your above "X" example may be one of those where my logic cannot get hold of valid ascender / descender value pair. My check is this (note that descender is always negative):

If asc - dsc >= 1, then all is roses and I accept both values.
If asc < 1e-3 then I set dsc = -0.1 and asc = 0.9 (addresses a Tessract OCR glyphless font issue).
If asc - dsc < 1, then I set asc = 1 + dsc (after possibly replacing dsc by glyph bbox.y0 value if that is smaller).
And all of this logic may still fail with some fonts and bring down the Python interpreter (with a segfault) if I try to call the MuPDF functions that deliver the asc / dsc values. I therefore introduced a separate global option to not even try this.

You see, the font business is one of the dirty ones.

JorjMcKie Jul 9, 2021
Maintainer

I just took a deeper look to see what is happening with this "X" symbol. It is encoded like this:

      {
       "size":9.962599754333496,
       "flags":6,
       "font":"CMSY10",
       "color":0,
       "ascender":0.7749999761581421,
       "descender":-0.9599999785423279,
       "text":"\u00d7",
       "origin":[
        136.2000274658203,
        347.8390808105469
       ],
       "bbox":[
        136.2000274658203,
        340.1180725097656,
        143.9409637451172,
        357.4031677246094
       ]
      },

You see the enormous dsc value of -0.96? Larger than the asc! That stupid font is telling us nonsense, but how are supposed to determine this?
You could try a correction yourself and arbitrarily limit the absolute value of dsc to 20% or 25% of the asc or something ...

inf3rnus Jul 9, 2021
Author

Ahhh, very informative. I appreciate your thoroughness. This will prove invaluable to me, and hopefully others :). I'll give the Font class a look and see if it will help meld my understanding of what I've done with what I've wrote and what's already cooked into PyMu.

Hope you have a great weekend!

-Aaron

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an extra option to extractDict() to preserve original font-names (computing bboxes by font) - IMPLEMENTED #1132

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Create an extra option to extractDict() to preserve original font-names (computing bboxes by font) - IMPLEMENTED #1132

inf3rnus Jul 8, 2021

Replies: 2 comments · 6 replies

JorjMcKie Jul 8, 2021 Maintainer

inf3rnus Jul 9, 2021 Author

JorjMcKie Jul 8, 2021 Maintainer

JorjMcKie Jul 8, 2021 Maintainer

inf3rnus Jul 9, 2021 Author

JorjMcKie Jul 9, 2021 Maintainer

JorjMcKie Jul 9, 2021 Maintainer

inf3rnus Jul 9, 2021 Author

inf3rnus
Jul 8, 2021

Replies: 2 comments 6 replies

JorjMcKie
Jul 8, 2021
Maintainer

inf3rnus Jul 9, 2021
Author

JorjMcKie
Jul 8, 2021
Maintainer

JorjMcKie Jul 8, 2021
Maintainer

inf3rnus Jul 9, 2021
Author

JorjMcKie Jul 9, 2021
Maintainer

JorjMcKie Jul 9, 2021
Maintainer

inf3rnus Jul 9, 2021
Author