Text coordinate extraction error #4182

Number18-tong · 2024-12-27T04:09:32Z

Description of the bug

Thanks for your great work, PDF parsing has become simpler and more convenient.
I have been use PyMuPDF for a while, and I find the problem that text coordinate extraction error in some pdfs.

Really hoping to find a way to solute this problem. Again, thanks for your great work!

How to reproduce the bug

The code:

import pymupdf
def getblock_lines_dict(fitz_dict):
linelist = []
## 获取每页的每行文本
for block in fitz_dict["blocks"]:
if block['type'] == 0: ## block type为0时是文本
paranum = block['number']
if 'lines' in block: # 如果文本块中有内容
for line in block['lines']: ## 认为line是一行文本
for span in line['spans']:
if span['text'].strip():
linelist.append([paranum, span['bbox'], span['text']])
return linelist

if name == "main":
doc = pymupdf.open("test.pdf") # open a document
for page in doc: # iterate the document pages
dict = page.get_text("dict")
linelist = getblock_lines_dict(dict)
print(linelist)

I draw a picture for the results, basically the position coordinates of all the numbers are wrong.

There is the test pdf
number_bbox_error.pdf

PyMuPDF version

1.25.1

Operating system

Linux

Python version

3.10

JorjMcKie · 2024-12-28T13:46:38Z

This is a duplicate of issue #4180:
Font object definitions in the PDF specify (wrong) positive values for "descender". Since recent versions, MuPDF prefers using PDF-provided values over those provided in the embedded font binary (where they are correct in this case).

JorjMcKie added duplicate upstream bug bug outside this package labels Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text coordinate extraction error #4182

Text coordinate extraction error #4182

Number18-tong commented Dec 27, 2024

JorjMcKie commented Dec 28, 2024

Text coordinate extraction error #4182

Text coordinate extraction error #4182

Comments

Number18-tong commented Dec 27, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Dec 28, 2024