-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong bbox location through #4115
Comments
Please use my email [email protected] |
Sent successful |
Looking now |
This is a weird case and apparently a problem cause by the base library. I need to open an issue there and thus must include the problem file. |
Sorry this is a customer file and cannot be shared to anyone. |
Sorry - I tried an example and found no problem. So this is not the way to get a reproducible situation. |
So give me some time, i will hide some senstive info and re-send a pdf to you. |
Thank you for the new file. I will use it for reporting the bug in MuPDF. MuPDF issue https://bugs.ghostscript.com/show_bug.cgi?id=708178 |
@JorjMcKie Hello, I have the same problem like "Wrong bbox location". Here is my pdf file. with fitz.open('78-1.pdf') as doc:
current_page = doc[0]
current_textpage = current_page.get_textpage()
print(current_page.rect,current_textpage.rect) and the output is : |
This is no problem and also has nothing to do at all with this issue. If you want to convert them to the rotated coordinate system, multiple each point and rectangle with the If you find this too complicated, you can also de-rotate the page without impact on visual appearance using |
@JorjMcKie Thanks for your prompt and professional responses. I just read the pymupdf doc and find Page.remove_rotation. It sovled my trouble exactly . Thanks! |
I am having a similar issue, pymupdf <= 1.24.14 works fine, >= 1.25.0 gives incorrect bbox values for some glyphs. A minimal working example is the TeX file, \documentclass[]{article}
\begin{document}
$$\sqrt{}\sum$$
\end{document} Using a simple script to draw bboxes around characters, import fitz
def draw_boxes(input_pdf, output_pdf):
doc = fitz.open(input_pdf)
for page_number, page in enumerate(doc):
text_dict = page.get_text("rawdict")
for block_number, block in enumerate(text_dict["blocks"]):
if block["type"] == 0:
for line_number, line in enumerate(block["lines"]):
for span_number, span in enumerate(line["spans"]):
for char_number, char in enumerate(span["chars"]):
page.draw_rect(char["bbox"], width=0.1)
doc.save(output_pdf) With the PDF generated with the above TeX produces this in 1.24.14, But, 1.25.0+ generates this, My guess is they were doing something upstream to fix some glyphs which were getting bounding boxes which were over-tall, but made some incorrect assumptions about how and why they were over-tall. |
Also, while you're filing upstream issues I thought you might want to mention this, For the same input pdf file, doc = fitz.open(input)
page = doc[0]
print(page.get_texttrace())
print(page.get_text("rawdict")) 'chars': ((65533, 1, (303.4100036621094, 137.2550048828125),
(303.4100036621094, 135.19415283203125,
317.7959899902344, 145.15673828125)), ),
} If we look at the |
@elmstedt little can be said solely based on pictures and without having a reproducing file at hand.
In general, the font definitions of corner case glyphs, like mathematical symbols, tend to be more sloppy than those for alphanumeric Unicodes. If we want to take a serious look at your problem, we certainly need the original PDF page. |
Did you not see this?
That is the reproducing file—it's the first thing I wrote.
Can you not compile the above? I thought providing the raw .tex file would be more convenient for you, I guess not... |
@elmstedt Looked at the file, thanks for that. pprint(page.get_fonts())
[(13, 'pfa', 'Type1', 'YQJSDJ+CMEX10', 'F21', ''), # <=== this is the font in question!
(14, 'pfa', 'Type1', 'SDXKYB+CMR10', 'F28', ''),
(12, 'pfa', 'Type1', 'FKFMOI+CMSY10', 'F34', '')]
# in v1.24.* the font binary was used:
ff=doc.extract_font(13)
font = pymupdf.Font(fontbuffer=ff[-1])
font.name
'Computer Modern Medium'
font.ascender
0.7720000147819519
font.descender # this value is nonsense and cause a giant bbox height:
-2.9600000381469727 The character bbox height computed using these values is (0.772 + 2.96)*fontsize. For font size 10, a character bbox hence has a height 37.32, where 80% of the bbox are below the base line coordinate. This corresponds to your first picture. Now, in version 1.25.0 MuPDF looks at the font definition as a PDF object, where we find completely different values: print(doc.xref_object(13))
<<
/BaseFont /YQJSDJ+CMEX10
/FirstChar 88
/FontDescriptor 19 0 R
/LastChar 88
/Subtype /Type1
/ToUnicode 8 0 R
/Type /Font
/Widths 17 0 R
>>
print(doc.xref_object(19))
<<
/Ascent 40
/CapHeight 0
/CharSet (/summationdisplay)
/Descent -600
/Flags 4
/FontBBox [ -24 -2960 1454 772 ]
/FontFile 5 0 R
/FontName /YQJSDJ+CMEX10
/ItalicAngle 0
/StemV 47
/Type /FontDescriptor
/XHeight 431
>> The {'ascender': 0.03999999910593033, # = 40/1000
'bbox': (303.4100036621094, 136.63233947753906, 317.80596923828125, 146.59494018554688),
'color': 0,
'descender': -0.6000000238418579, # = -600/1000
'flags': 5,
'font': 'CMEX10',
'origin': (303.4100036621094, 137.2550048828125),
'size': 9.962599754333496,
'text': 'X'}, This corresponds to your second picture. The bbox still looks crazy enough, but that is all that can be done. |
@JorjMcKie I appreciate you taking the time to look into it further. I've been needing tight glyph bounding boxes for a while now, so I've already been using my own solution using a generated AFM file from the extracted fonts. I've been going a very roundabout way of getting there though. Using qpdf to pull the individual streams from the PDF, then fontforge to generate a It has been... tedious, so I have been periodically looking for other/better solutions. I just happened to notice the |
Description of the bug
Seems I got a wrong bbox location for a particular PDF
Please provide an email so I can send this particular PDF as it is quite senstive.
get_text('rawdict')
How to reproduce the bug
`
coding: utf-8
Created by hujian on 2024/12/5 17:20
wrong bbox for this particular PDF
import fitz
file = '/Users/hujian/Downloads/07.+E-statement+september+2024_M+Zidni.pdf'
pdf = fitz.Document(file)
get page 1 data
data = pdf[0].get_text('rawdict')
data[19]
please look into this bbox and check if it is right
`
PyMuPDF version
1.24.13
Operating system
MacOS
Python version
3.10
The text was updated successfully, but these errors were encountered: