Why is the order of extracting the contents in the table cells wrong? #1069

xielulu1994 · 2023-12-08T02:54:17Z

xielulu1994
Dec 8, 2023

Describe the bug

extract_table() to extract table content, and find that the order of extracted text in individual cells is inconsistent with the original text.

pdf table：

Code to reproduce the problem

table_text_items: List[tuple] = []
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            table = page.extract_table()
            lines: List[str] = []
            if table:
                for row in table:
                    for line in [item for item in row if item is not None]:
                        if line:
                            lines.extend(line.split("\n"))
            if lines:
                table_text_items.append((page.page_number, lines))

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

jsvine · 2023-12-21T20:18:30Z

jsvine
Dec 21, 2023
Maintainer

Hi @xielulu1994, and thanks for your interest in pdfplumber. Without access to the PDF itself, it is difficult to provide tested advice. However, from looking at the screenshots, my guess is that the bounding boxes of the characters in the cd \d[...] text are a bit above the rest of the text. Try this:

page.extract_table({ "text_y_tolerance": 5 })

... although you may have to try adjusting that 5 value higher or lower, depending on the specifics of the PDF.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the order of extracting the contents in the table cells wrong? #1069

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why is the order of extracting the contents in the table cells wrong? #1069

xielulu1994 Dec 8, 2023

Describe the bug

Code to reproduce the problem

PDF file

Replies: 1 comment

jsvine Dec 21, 2023 Maintainer

xielulu1994
Dec 8, 2023

jsvine
Dec 21, 2023
Maintainer