You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your great work, PDF parsing has become simpler and more convenient.
I have been use PyMuPDF for a while, and I find the problem that text coordinate extraction error in some pdfs.
Really hoping to find a way to solute this problem. Again, thanks for your great work!
How to reproduce the bug
The code:
import pymupdf
def getblock_lines_dict(fitz_dict):
linelist = []
## 获取每页的 每行文本
for block in fitz_dict["blocks"]:
if block['type'] == 0: ## block type为0时是文本
paranum = block['number']
if 'lines' in block: # 如果文本块中有内容
for line in block['lines']: ## 认为line是一行文本
for span in line['spans']:
if span['text'].strip():
linelist.append([paranum, span['bbox'], span['text']])
return linelist
if name == "main":
doc = pymupdf.open("test.pdf") # open a document
for page in doc: # iterate the document pages
dict = page.get_text("dict")
linelist = getblock_lines_dict(dict)
print(linelist)
I draw a picture for the results, basically the position coordinates of all the numbers are wrong.
This is a duplicate of issue #4180:
Font object definitions in the PDF specify (wrong) positive values for "descender". Since recent versions, MuPDF prefers using PDF-provided values over those provided in the embedded font binary (where they are correct in this case).
Description of the bug
Thanks for your great work, PDF parsing has become simpler and more convenient.
I have been use PyMuPDF for a while, and I find the problem that text coordinate extraction error in some pdfs.
Really hoping to find a way to solute this problem. Again, thanks for your great work!
How to reproduce the bug
The code:
import pymupdf
def getblock_lines_dict(fitz_dict):
linelist = []
## 获取每页的 每行文本
for block in fitz_dict["blocks"]:
if block['type'] == 0: ## block type为0时是文本
paranum = block['number']
if 'lines' in block: # 如果文本块中有内容
for line in block['lines']: ## 认为line是一行文本
for span in line['spans']:
if span['text'].strip():
linelist.append([paranum, span['bbox'], span['text']])
return linelist
if name == "main":
doc = pymupdf.open("test.pdf") # open a document
for page in doc: # iterate the document pages
dict = page.get_text("dict")
linelist = getblock_lines_dict(dict)
print(linelist)
I draw a picture for the results, basically the position coordinates of all the numbers are wrong.
There is the test pdf
number_bbox_error.pdf
PyMuPDF version
1.25.1
Operating system
Linux
Python version
3.10
The text was updated successfully, but these errors were encountered: