Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong bbox location through #4115

Open
HuJianE opened this issue Dec 5, 2024 · 17 comments
Open

Wrong bbox location through #4115

HuJianE opened this issue Dec 5, 2024 · 17 comments
Labels
upstream bug bug outside this package

Comments

@HuJianE
Copy link

HuJianE commented Dec 5, 2024

Description of the bug

Seems I got a wrong bbox location for a particular PDF
Please provide an email so I can send this particular PDF as it is quite senstive.
get_text('rawdict')

How to reproduce the bug

`

coding: utf-8

Created by hujian on 2024/12/5 17:20

wrong bbox for this particular PDF

import fitz
file = '/Users/hujian/Downloads/07.+E-statement+september+2024_M+Zidni.pdf'
pdf = fitz.Document(file)

get page 1 data

data = pdf[0].get_text('rawdict')
data[19]

please look into this bbox and check if it is right

`

PyMuPDF version

1.24.13

Operating system

MacOS

Python version

3.10

@JorjMcKie
Copy link
Collaborator

Please use my email [email protected]

@HuJianE
Copy link
Author

HuJianE commented Dec 5, 2024

Sent successful

@JorjMcKie
Copy link
Collaborator

Looking now

@JorjMcKie
Copy link
Collaborator

This is a weird case and apparently a problem cause by the base library. I need to open an issue there and thus must include the problem file.
Because of the confidentiality, I need your formal consent.

@HuJianE
Copy link
Author

HuJianE commented Dec 5, 2024

Sorry this is a customer file and cannot be shared to anyone.
But what I can be sure is that it is the editor Canva
You can upload a pdf to canva and do some random edit, you would get this kind of issue
This is their web
https://www.canva.com/pdf-editor/

@JorjMcKie
Copy link
Collaborator

Sorry - I tried an example and found no problem. So this is not the way to get a reproducible situation.
The MuPDF team is just as trustworthy as I am and we have ways to keep problem files hidden from the public in all situations.
So I again request that you either let me share the file with them or provide one without confidentiality concerns.

@HuJianE
Copy link
Author

HuJianE commented Dec 5, 2024

So give me some time, i will hide some senstive info and re-send a pdf to you.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Dec 6, 2024

Thank you for the new file. I will use it for reporting the bug in MuPDF.
test.pdf
test.json

MuPDF issue https://bugs.ghostscript.com/show_bug.cgi?id=708178

@patsnap-liujin
Copy link

@JorjMcKie Hello, I have the same problem like "Wrong bbox location". Here is my pdf file.
78-1.pdf
pymupdf version is 1.25.1, the test code is:

with fitz.open('78-1.pdf') as doc:
    current_page = doc[0]
    current_textpage = current_page.get_textpage()   
    print(current_page.rect,current_textpage.rect)

and the output is :
Rect(0.0, 0.0, 792.0, 612.0) Rect(0.0, 0.0, 612.0, 792.0)
the bboxes I got using current_page.get_text('text') are all wrong because the wrong Rect printed before.

@JorjMcKie
Copy link
Collaborator

@patsnap-liujin

the bboxes I got using current_page.get_text('text') are all wrong

This is no problem and also has nothing to do at all with this issue.
What you see is working as designed:
Your page is rotated. If you read the documentation carefully, you will find that all extractions, including text extraction, are based on the unrotated page. This has the consequence that the textpage's rect shows the unrotated page retangle and text coordinates are relative to that. They are not "wrong".

If you want to convert them to the rotated coordinate system, multiple each point and rectangle with the page.rotation_matrix.

If you find this too complicated, you can also de-rotate the page without impact on visual appearance using page.remove_rotation(). Then make the textpage and do your extractions.

@patsnap-liujin
Copy link

patsnap-liujin commented Dec 12, 2024

@JorjMcKie Thanks for your prompt and professional responses. I just read the pymupdf doc and find Page.remove_rotation. It sovled my trouble exactly . Thanks!

@elmstedt
Copy link

elmstedt commented Dec 13, 2024

I am having a similar issue, pymupdf <= 1.24.14 works fine, >= 1.25.0 gives incorrect bbox values for some glyphs.

A minimal working example is the TeX file,

\documentclass[]{article}
\begin{document}
$$\sqrt{}\sum$$
\end{document}

Using a simple script to draw bboxes around characters,

import fitz
def draw_boxes(input_pdf, output_pdf):
    doc = fitz.open(input_pdf)
    for page_number, page in enumerate(doc):
        text_dict = page.get_text("rawdict")
        for block_number, block in enumerate(text_dict["blocks"]):
            if block["type"] == 0:
                for line_number, line in enumerate(block["lines"]):
                    for span_number, span in enumerate(line["spans"]):
                        for char_number, char in enumerate(span["chars"]):
                            page.draw_rect(char["bbox"], width=0.1)
    doc.save(output_pdf)

With the PDF generated with the above TeX produces this in 1.24.14,

image

But, 1.25.0+ generates this,

Screenshot from 2024-12-13 13-43-40

My guess is they were doing something upstream to fix some glyphs which were getting bounding boxes which were over-tall, but made some incorrect assumptions about how and why they were over-tall.

@elmstedt
Copy link

Also, while you're filing upstream issues I thought you might want to mention this,

For the same input pdf file,

doc = fitz.open(input)
page = doc[0]
print(page.get_texttrace())
print(page.get_text("rawdict"))
'chars': ((65533, 1, (303.4100036621094, 137.2550048828125),
              (303.4100036621094, 135.19415283203125,
              317.7959899902344, 145.15673828125)), ),
    }

If we look at the texttrace for the sum glyph, it is given a code point of 65533 when it should be 88 the codepoint for "X" which is how the CMEX font maps the glyph summationdisplay.

@JorjMcKie
Copy link
Collaborator

@elmstedt little can be said solely based on pictures and without having a reproducing file at hand.
But there indeed have been changes in this area. A few comments may help to understand what's going on:

  • Fonts need not be correct in terms of the metrics for all the glyphs they contain. When computing a character's bbox, some values are normally needed: the insertion point ("origin"), ascender, descender, width and font size. Ascender and descender are used to compute y0 and y1 of the bbox. Incorrect values lead to those crazily tall bboxes.
  • Font object definitions in a PDF may override above font values. Often, not always, we find ascender / descender values there, and deviating character widths. Since the most recent versions, our base library always takes PDF overrides when found - and then ignores the font-internal values. This previously happened only when the font values where clearly wrong / missing.
    Needless to mention, that wrong values in the PDF object definition are also a daily experience. So this approach won't always work either.
  • There is a new, not yet documented text extraction flag TEXT_ACCURATE_BBOXES. This aims at recomputing the bbox from the glyph graphics instructions.
  • Many fonts are incomplete in terms of providing a backtranslation table "glyph-to-Unicode". When this is determined by MuPDF, the Unicode Replacement 0xFFFD = 65533 is returned. PyMuPDF's standard text extraction (not get_texttrace()) by default uses the glyph number in the font's glyph table in such a case (if present) - which is often helpful, but not always either.

In general, the font definitions of corner case glyphs, like mathematical symbols, tend to be more sloppy than those for alphanumeric Unicodes.

If we want to take a serious look at your problem, we certainly need the original PDF page.

@elmstedt
Copy link

@JorjMcKie

@elmstedt little can be said solely based on pictures and without having a reproducing file at hand.

Did you not see this?

A minimal working example is the TeX file,

\documentclass[]{article}
\begin{document}
$$\sqrt{}\sum$$
\end{document}

That is the reproducing file—it's the first thing I wrote.

If we want to take a serious look at your problem, we certainly need the original PDF page.

Can you not compile the above? I thought providing the raw .tex file would be more convenient for you, I guess not...

mwe.pdf

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Dec 14, 2024

@elmstedt Looked at the file, thanks for that.
It turns out that in both cases MuPDF returns correct values both, pre-v1.25.0 and now. As I wrote:
In versions 1.24.* the font binary was inspected to find ascender / descender values.

pprint(page.get_fonts())
[(13, 'pfa', 'Type1', 'YQJSDJ+CMEX10', 'F21', ''),  # <=== this is the font in question!
 (14, 'pfa', 'Type1', 'SDXKYB+CMR10', 'F28', ''),
 (12, 'pfa', 'Type1', 'FKFMOI+CMSY10', 'F34', '')]

# in v1.24.* the font binary was used:
ff=doc.extract_font(13)
font = pymupdf.Font(fontbuffer=ff[-1])
font.name
'Computer Modern Medium'
font.ascender
0.7720000147819519
font.descender  # this value is nonsense and cause a giant bbox height:
-2.9600000381469727

The character bbox height computed using these values is (0.772 + 2.96)*fontsize. For font size 10, a character bbox hence has a height 37.32, where 80% of the bbox are below the base line coordinate. This corresponds to your first picture.

Now, in version 1.25.0 MuPDF looks at the font definition as a PDF object, where we find completely different values:

print(doc.xref_object(13))
<<
  /BaseFont /YQJSDJ+CMEX10
  /FirstChar 88
  /FontDescriptor 19 0 R
  /LastChar 88
  /Subtype /Type1
  /ToUnicode 8 0 R
  /Type /Font
  /Widths 17 0 R
>>
print(doc.xref_object(19))
<<
  /Ascent 40
  /CapHeight 0
  /CharSet (/summationdisplay)
  /Descent -600
  /Flags 4
  /FontBBox [ -24 -2960 1454 772 ]
  /FontFile 5 0 R
  /FontName /YQJSDJ+CMEX10
  /ItalicAngle 0
  /StemV 47
  /Type /FontDescriptor
  /XHeight 431
>>

The /Ascent and /Descent values must be divided by 1000 to arrive at analogous dimensions, giving us these correct values in PyMuPDF:

 {'ascender': 0.03999999910593033,  # = 40/1000
  'bbox': (303.4100036621094, 136.63233947753906, 317.80596923828125, 146.59494018554688),
  'color': 0,
  'descender': -0.6000000238418579,  # = -600/1000
  'flags': 5,
  'font': 'CMEX10',
  'origin': (303.4100036621094, 137.2550048828125),
  'size': 9.962599754333496,
  'text': 'X'},

This corresponds to your second picture. The bbox still looks crazy enough, but that is all that can be done.

@elmstedt
Copy link

@JorjMcKie I appreciate you taking the time to look into it further. I've been needing tight glyph bounding boxes for a while now, so I've already been using my own solution using a generated AFM file from the extracted fonts.

I've been going a very roundabout way of getting there though. Using qpdf to pull the individual streams from the PDF, then fontforge to generate a .pfa version of the embedded font just for the side-effect of creating the .afm file I want in order to easily get the glyph specific bounding boxes I need. Then I have been merging that back into the output I have been getting from page.get_text("rawdict").

It has been... tedious, so I have been periodically looking for other/better solutions. I just happened to notice the bbox value coming out of page.get_text() had changed between 1.24.14 and 1.25.0, so I thought I would chime in with my experience and a bit more information. I am VERY interested in this TEXT_ACCURATE_BBOXES you mentioned and I would love to learn more about it. If that works as intended I could almost do away with needing to use any other tools other than pymupdf for text extraction in my current workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

4 participants