Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text coordinate extraction error #4182

Open
Number18-tong opened this issue Dec 27, 2024 · 1 comment
Open

Text coordinate extraction error #4182

Number18-tong opened this issue Dec 27, 2024 · 1 comment
Labels
duplicate upstream bug bug outside this package

Comments

@Number18-tong
Copy link

Description of the bug

Thanks for your great work, PDF parsing has become simpler and more convenient.
I have been use PyMuPDF for a while, and I find the problem that text coordinate extraction error in some pdfs.

Really hoping to find a way to solute this problem. Again, thanks for your great work!

How to reproduce the bug

The code:

import pymupdf
def getblock_lines_dict(fitz_dict):
linelist = []
## 获取每页的 每行文本
for block in fitz_dict["blocks"]:
if block['type'] == 0: ## block type为0时是文本
paranum = block['number']
if 'lines' in block: # 如果文本块中有内容
for line in block['lines']: ## 认为line是一行文本
for span in line['spans']:
if span['text'].strip():
linelist.append([paranum, span['bbox'], span['text']])
return linelist

if name == "main":
doc = pymupdf.open("test.pdf") # open a document
for page in doc: # iterate the document pages
dict = page.get_text("dict")
linelist = getblock_lines_dict(dict)
print(linelist)

I draw a picture for the results, basically the position coordinates of all the numbers are wrong.
企业微信截图_17352721963259

There is the test pdf
number_bbox_error.pdf

PyMuPDF version

1.25.1

Operating system

Linux

Python version

3.10

@JorjMcKie JorjMcKie added duplicate upstream bug bug outside this package labels Dec 28, 2024
@JorjMcKie
Copy link
Collaborator

This is a duplicate of issue #4180:
Font object definitions in the PDF specify (wrong) positive values for "descender". Since recent versions, MuPDF prefers using PDF-provided values over those provided in the embedded font binary (where they are correct in this case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

2 participants