Extracting Table structure from PDF? #1021

vs-777 · 2021-04-20T11:28:43Z

vs-777
Apr 20, 2021

I am working tables within PDF documents. As an example, I have attached a PDF (Excel-table.pdf) that is an Excel table that I have exported as a PDF. If you highlight the orange cells, it only selects the words within the cell. Whereas on the 2nd PDF (Tesseract-table.pdf), which is simply a text layer beneath an image, if you try to select the same cell with 2 lines, it will highlight the whole row.

Is there a way to create this table structure through PyMuPDF as in the first document, or is there maybe a way to extract the meta-data to have an understanding how it is separating each cell?

Thanks.

Edit: If you open the first document in Adobe Reader, copy the table and paste it in Word, it creates a table format. Is this functionality possible?

Excel-table.pdf
Tesseract-table.pdf

JorjMcKie · 2021-04-20T12:15:26Z

JorjMcKie
Apr 20, 2021
Maintainer

simply a text layer beneath an image, ...

Of course. An OCR software cannot detect high-level structures like tables. So we can take the OCR-ed version out of the discussion right away.

(Py-) MuPDF does not (yet) support PDF's table support feature. If you read the PDF spec on this, you will probably soon see, that this is an extremely complex topic involving many dozens of interconnected PDF objects to describe table meta information.

There does however exist the option to extract low-level PDF object information in PyMuPDF (doc.xref_get_key(xref, ...)). If you know the mentioned PDF structures for specifying tables, you can literally access everything.

paste it in Word, it creates a table format.

This is due to TAB and other control characters contained in the clipboard data. Obviously sufficient for Word to create a table from them. I found Foxit Reader to also support this. Nitro PDF, SumtraPDF and most others don't.
In PyMuPDF there is no such support. You can suppress the conversion of white space characters to space via the flags parameter in page.get_text(). And then interpret whatever you find in the spans of the output of get_text("dict",...).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting Table structure from PDF? #1021

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Extracting Table structure from PDF? #1021

vs-777 Apr 20, 2021

Replies: 1 comment

JorjMcKie Apr 20, 2021 Maintainer

vs-777
Apr 20, 2021

JorjMcKie
Apr 20, 2021
Maintainer