Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

para/para_split_v3.py 出问题了 #1401

Open
xcvil opened this issue Jan 3, 2025 · 3 comments
Open

para/para_split_v3.py 出问题了 #1401

xcvil opened this issue Jan 3, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@xcvil
Copy link

xcvil commented Jan 3, 2025

Description of the bug | 错误描述

"""
Traceback (most recent call last):
...
pipe_result = (infer_result.pipe_ocr_mode(image_writer)
File "/home/.../conda/envs/MinerU2/lib/python3.10/site-packages/magic_pdf/model/operators.py", line 180, in pipe_ocr_mode
res = self.apply(
File "/home/.../conda/envs/MinerU2/lib/python3.10/site-packages/magic_pdf/model/operators.py", line 72, in apply
return proc(copy.deepcopy(self._infer_res), *args, **kwargs)
File "/home/.../conda/envs/MinerU2/lib/python3.10/site-packages/magic_pdf/model/operators.py", line 173, in proc
res = pdf_parse_union(*args, **kwargs)
File "/home/.../conda/envs/MinerU2/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 820, in pdf_parse_union
para_split(pdf_info_dict)
File "/home/.../conda/envs/MinerU2/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 378, in para_split
__para_merge_page(all_blocks)
File "/home/.../conda/envs/MinerU2/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 355, in __para_merge_page
__merge_2_text_blocks(current_block, prev_block)
File "/home/.../conda/envs/MinerU2/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 288, in __merge_2_text_blocks
and not last_span['content'].endswith(LINE_STOP_FLAG)
KeyError: 'content'
"""

How to reproduce the bug | 如何复现

from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.config.enums import SupportedPdfParseMethod

doc_output_dir = ...
image_dir = ...

os.makedirs(image_dir, exist_ok=True)

# Initialize writers
image_writer = FileBasedDataWriter(image_dir)
md_writer = FileBasedDataWriter(doc_output_dir)

# Read PDF content
reader = FileBasedDataReader("")
pdf_bytes = reader.read(pdf_path)

# Process PDF
ds = PymuDocDataset(pdf_bytes)
infer_result = ds.apply(doc_analyze, ocr=(ds.classify() == SupportedPdfParseMethod.OCR))

# Generate output
pipe_result = (infer_result.pipe_ocr_mode(image_writer) 
              if ds.classify() == SupportedPdfParseMethod.OCR 
              else infer_result.pipe_txt_mode(image_writer))

pipe_result.dump_md(
    md_writer, 
    f"{...}.md",
    os.path.basename(image_dir)
)
logging.info(f"Successfully processed {pdf_path}")

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.10.x

Device mode | 设备模式

cuda

@xcvil xcvil added the bug Something isn't working label Jan 3, 2025
@myhloli
Copy link
Collaborator

myhloli commented Jan 4, 2025

能上传一下出问题的pdf吗

@xcvil
Copy link
Author

xcvil commented Jan 4, 2025

能上传一下出问题的pdf吗

PDF我暂时还没找到是哪个……我可以问一下这个是因为PDF本身的问题吗?如果是的话,我这边可以找一下看一下PDF试着debug一下。我一开始担心是我的pipeline的问题。

@myhloli
Copy link
Collaborator

myhloli commented Jan 4, 2025

目前不清楚哪里的问题,得有pdf文件来复现一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants