Wrong bbox location through #4115

HuJianE · 2024-12-05T09:25:55Z

Description of the bug

Seems I got a wrong bbox location for a particular PDF
Please provide an email so I can send this particular PDF as it is quite senstive.
get_text('rawdict')

How to reproduce the bug

`

coding: utf-8

Created by hujian on 2024/12/5 17:20

wrong bbox for this particular PDF

import fitz
file = '/Users/hujian/Downloads/07.+E-statement+september+2024_M+Zidni.pdf'
pdf = fitz.Document(file)

get page 1 data

data = pdf[0].get_text('rawdict')
data[19]

please look into this bbox and check if it is right

`

PyMuPDF version

1.24.13

Operating system

MacOS

Python version

3.10

JorjMcKie · 2024-12-05T09:44:08Z

Please use my email [email protected]

HuJianE · 2024-12-05T10:23:46Z

Sent successful

JorjMcKie · 2024-12-05T11:10:08Z

Looking now

JorjMcKie · 2024-12-05T12:16:18Z

This is a weird case and apparently a problem cause by the base library. I need to open an issue there and thus must include the problem file.
Because of the confidentiality, I need your formal consent.

HuJianE · 2024-12-05T12:28:50Z

Sorry this is a customer file and cannot be shared to anyone.
But what I can be sure is that it is the editor Canva
You can upload a pdf to canva and do some random edit, you would get this kind of issue
This is their web
https://www.canva.com/pdf-editor/

JorjMcKie · 2024-12-05T12:53:07Z

Sorry - I tried an example and found no problem. So this is not the way to get a reproducible situation.
The MuPDF team is just as trustworthy as I am and we have ways to keep problem files hidden from the public in all situations.
So I again request that you either let me share the file with them or provide one without confidentiality concerns.

HuJianE · 2024-12-05T13:23:41Z

So give me some time, i will hide some senstive info and re-send a pdf to you.

JorjMcKie · 2024-12-06T08:14:23Z

Thank you for the new file. I will use it for reporting the bug in MuPDF.
test.pdf
test.json

MuPDF issue https://bugs.ghostscript.com/show_bug.cgi?id=708178

patsnap-liujin · 2024-12-12T07:18:45Z

@JorjMcKie Hello, I have the same problem like "Wrong bbox location". Here is my pdf file.
78-1.pdf
pymupdf version is 1.25.1, the test code is:

with fitz.open('78-1.pdf') as doc:
    current_page = doc[0]
    current_textpage = current_page.get_textpage()   
    print(current_page.rect,current_textpage.rect)

and the output is :
Rect(0.0, 0.0, 792.0, 612.0) Rect(0.0, 0.0, 612.0, 792.0)
the bboxes I got using current_page.get_text('text') are all wrong because the wrong Rect printed before.

JorjMcKie · 2024-12-12T07:40:02Z

@patsnap-liujin

the bboxes I got using current_page.get_text('text') are all wrong

This is no problem and also has nothing to do at all with this issue.
What you see is working as designed:
Your page is rotated. If you read the documentation carefully, you will find that all extractions, including text extraction, are based on the unrotated page. This has the consequence that the textpage's rect shows the unrotated page retangle and text coordinates are relative to that. They are not "wrong".

If you want to convert them to the rotated coordinate system, multiple each point and rectangle with the page.rotation_matrix.

If you find this too complicated, you can also de-rotate the page without impact on visual appearance using page.remove_rotation(). Then make the textpage and do your extractions.

patsnap-liujin · 2024-12-12T08:07:24Z

@JorjMcKie Thanks for your prompt and professional responses. I just read the pymupdf doc and find Page.remove_rotation. It sovled my trouble exactly . Thanks!

elmstedt · 2024-12-13T21:46:52Z

I am having a similar issue, pymupdf <= 1.24.14 works fine, >= 1.25.0 gives incorrect bbox values for some glyphs.

A minimal working example is the TeX file,

\documentclass[]{article}
\begin{document}
$$\sqrt{}\sum$$
\end{document}

Using a simple script to draw bboxes around characters,

import fitz
def draw_boxes(input_pdf, output_pdf):
    doc = fitz.open(input_pdf)
    for page_number, page in enumerate(doc):
        text_dict = page.get_text("rawdict")
        for block_number, block in enumerate(text_dict["blocks"]):
            if block["type"] == 0:
                for line_number, line in enumerate(block["lines"]):
                    for span_number, span in enumerate(line["spans"]):
                        for char_number, char in enumerate(span["chars"]):
                            page.draw_rect(char["bbox"], width=0.1)
    doc.save(output_pdf)

With the PDF generated with the above TeX produces this in 1.24.14,

But, 1.25.0+ generates this,

My guess is they were doing something upstream to fix some glyphs which were getting bounding boxes which were over-tall, but made some incorrect assumptions about how and why they were over-tall.

elmstedt · 2024-12-13T21:56:54Z

Also, while you're filing upstream issues I thought you might want to mention this,

For the same input pdf file,

doc = fitz.open(input)
page = doc[0]
print(page.get_texttrace())
print(page.get_text("rawdict"))

'chars': ((65533, 1, (303.4100036621094, 137.2550048828125),
              (303.4100036621094, 135.19415283203125,
              317.7959899902344, 145.15673828125)), ),
    }

If we look at the texttrace for the sum glyph, it is given a code point of 65533 when it should be 88 the codepoint for "X" which is how the CMEX font maps the glyph summationdisplay.

JorjMcKie · 2024-12-14T10:25:15Z

@elmstedt little can be said solely based on pictures and without having a reproducing file at hand.
But there indeed have been changes in this area. A few comments may help to understand what's going on:

Fonts need not be correct in terms of the metrics for all the glyphs they contain. When computing a character's bbox, some values are normally needed: the insertion point ("origin"), ascender, descender, width and font size. Ascender and descender are used to compute y0 and y1 of the bbox. Incorrect values lead to those crazily tall bboxes.
Font object definitions in a PDF may override above font values. Often, not always, we find ascender / descender values there, and deviating character widths. Since the most recent versions, our base library always takes PDF overrides when found - and then ignores the font-internal values. This previously happened only when the font values where clearly wrong / missing.
Needless to mention, that wrong values in the PDF object definition are also a daily experience. So this approach won't always work either.
There is a new, not yet documented text extraction flag TEXT_ACCURATE_BBOXES. This aims at recomputing the bbox from the glyph graphics instructions.
Many fonts are incomplete in terms of providing a backtranslation table "glyph-to-Unicode". When this is determined by MuPDF, the Unicode Replacement 0xFFFD = 65533 is returned. PyMuPDF's standard text extraction (not get_texttrace()) by default uses the glyph number in the font's glyph table in such a case (if present) - which is often helpful, but not always either.

In general, the font definitions of corner case glyphs, like mathematical symbols, tend to be more sloppy than those for alphanumeric Unicodes.

If we want to take a serious look at your problem, we certainly need the original PDF page.

elmstedt · 2024-12-14T14:22:58Z

@JorjMcKie

@elmstedt little can be said solely based on pictures and without having a reproducing file at hand.

Did you not see this?

A minimal working example is the TeX file,

\documentclass[]{article}
\begin{document}
$$\sqrt{}\sum$$
\end{document}

That is the reproducing file—it's the first thing I wrote.

If we want to take a serious look at your problem, we certainly need the original PDF page.

Can you not compile the above? I thought providing the raw .tex file would be more convenient for you, I guess not...

mwe.pdf

JorjMcKie · 2024-12-14T22:41:07Z

@elmstedt Looked at the file, thanks for that.
It turns out that in both cases MuPDF returns correct values both, pre-v1.25.0 and now. As I wrote:
In versions 1.24.* the font binary was inspected to find ascender / descender values.

pprint(page.get_fonts())
[(13, 'pfa', 'Type1', 'YQJSDJ+CMEX10', 'F21', ''),  # <=== this is the font in question!
 (14, 'pfa', 'Type1', 'SDXKYB+CMR10', 'F28', ''),
 (12, 'pfa', 'Type1', 'FKFMOI+CMSY10', 'F34', '')]

# in v1.24.* the font binary was used:
ff=doc.extract_font(13)
font = pymupdf.Font(fontbuffer=ff[-1])
font.name
'Computer Modern Medium'
font.ascender
0.7720000147819519
font.descender  # this value is nonsense and cause a giant bbox height:
-2.9600000381469727

The character bbox height computed using these values is (0.772 + 2.96)*fontsize. For font size 10, a character bbox hence has a height 37.32, where 80% of the bbox are below the base line coordinate. This corresponds to your first picture.

Now, in version 1.25.0 MuPDF looks at the font definition as a PDF object, where we find completely different values:

print(doc.xref_object(13))
<<
  /BaseFont /YQJSDJ+CMEX10
  /FirstChar 88
  /FontDescriptor 19 0 R
  /LastChar 88
  /Subtype /Type1
  /ToUnicode 8 0 R
  /Type /Font
  /Widths 17 0 R
>>
print(doc.xref_object(19))
<<
  /Ascent 40
  /CapHeight 0
  /CharSet (/summationdisplay)
  /Descent -600
  /Flags 4
  /FontBBox [ -24 -2960 1454 772 ]
  /FontFile 5 0 R
  /FontName /YQJSDJ+CMEX10
  /ItalicAngle 0
  /StemV 47
  /Type /FontDescriptor
  /XHeight 431
>>

The /Ascent and /Descent values must be divided by 1000 to arrive at analogous dimensions, giving us these correct values in PyMuPDF:

 {'ascender': 0.03999999910593033,  # = 40/1000
  'bbox': (303.4100036621094, 136.63233947753906, 317.80596923828125, 146.59494018554688),
  'color': 0,
  'descender': -0.6000000238418579,  # = -600/1000
  'flags': 5,
  'font': 'CMEX10',
  'origin': (303.4100036621094, 137.2550048828125),
  'size': 9.962599754333496,
  'text': 'X'},

This corresponds to your second picture. The bbox still looks crazy enough, but that is all that can be done.

elmstedt · 2024-12-15T00:14:06Z

@JorjMcKie I appreciate you taking the time to look into it further. I've been needing tight glyph bounding boxes for a while now, so I've already been using my own solution using a generated AFM file from the extracted fonts.

I've been going a very roundabout way of getting there though. Using qpdf to pull the individual streams from the PDF, then fontforge to generate a .pfa version of the embedded font just for the side-effect of creating the .afm file I want in order to easily get the glyph specific bounding boxes I need. Then I have been merging that back into the output I have been getting from page.get_text("rawdict").

It has been... tedious, so I have been periodically looking for other/better solutions. I just happened to notice the bbox value coming out of page.get_text() had changed between 1.24.14 and 1.25.0, so I thought I would chime in with my experience and a bit more information. I am VERY interested in this TEXT_ACCURATE_BBOXES you mentioned and I would love to learn more about it. If that works as intended I could almost do away with needing to use any other tools other than pymupdf for text extraction in my current workflow.

JorjMcKie added example required Waiting for information labels Dec 5, 2024

JorjMcKie added upstream bug bug outside this package and removed example required Waiting for information labels Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong bbox location through #4115

Wrong bbox location through #4115

HuJianE commented Dec 5, 2024

JorjMcKie commented Dec 5, 2024

HuJianE commented Dec 5, 2024

JorjMcKie commented Dec 5, 2024

JorjMcKie commented Dec 5, 2024

HuJianE commented Dec 5, 2024

JorjMcKie commented Dec 5, 2024

HuJianE commented Dec 5, 2024

JorjMcKie commented Dec 6, 2024 •

edited

Loading

patsnap-liujin commented Dec 12, 2024

JorjMcKie commented Dec 12, 2024

patsnap-liujin commented Dec 12, 2024 •

edited

Loading

elmstedt commented Dec 13, 2024 •

edited

Loading

elmstedt commented Dec 13, 2024

JorjMcKie commented Dec 14, 2024

elmstedt commented Dec 14, 2024

JorjMcKie commented Dec 14, 2024 •

edited

Loading

elmstedt commented Dec 15, 2024

Wrong bbox location through #4115

Wrong bbox location through #4115

Comments

HuJianE commented Dec 5, 2024

Description of the bug

How to reproduce the bug

coding: utf-8

Created by hujian on 2024/12/5 17:20

wrong bbox for this particular PDF

get page 1 data

please look into this bbox and check if it is right

PyMuPDF version

Operating system

Python version

JorjMcKie commented Dec 5, 2024

HuJianE commented Dec 5, 2024

JorjMcKie commented Dec 5, 2024

JorjMcKie commented Dec 5, 2024

HuJianE commented Dec 5, 2024

JorjMcKie commented Dec 5, 2024

HuJianE commented Dec 5, 2024

JorjMcKie commented Dec 6, 2024 • edited Loading

patsnap-liujin commented Dec 12, 2024

JorjMcKie commented Dec 12, 2024

patsnap-liujin commented Dec 12, 2024 • edited Loading

elmstedt commented Dec 13, 2024 • edited Loading

elmstedt commented Dec 13, 2024

JorjMcKie commented Dec 14, 2024

elmstedt commented Dec 14, 2024

JorjMcKie commented Dec 14, 2024 • edited Loading

elmstedt commented Dec 15, 2024

JorjMcKie commented Dec 6, 2024 •

edited

Loading

patsnap-liujin commented Dec 12, 2024 •

edited

Loading

elmstedt commented Dec 13, 2024 •

edited

Loading

JorjMcKie commented Dec 14, 2024 •

edited

Loading