Extracting Table structure from PDF? #1021
Replies: 1 comment
-
Of course. An OCR software cannot detect high-level structures like tables. So we can take the OCR-ed version out of the discussion right away. (Py-) MuPDF does not (yet) support PDF's table support feature. If you read the PDF spec on this, you will probably soon see, that this is an extremely complex topic involving many dozens of interconnected PDF objects to describe table meta information. There does however exist the option to extract low-level PDF object information in PyMuPDF (
This is due to TAB and other control characters contained in the clipboard data. Obviously sufficient for Word to create a table from them. I found Foxit Reader to also support this. Nitro PDF, SumtraPDF and most others don't. |
Beta Was this translation helpful? Give feedback.
-
Hi @JorjMcKie
I am working tables within PDF documents. As an example, I have attached a PDF (
Excel-table.pdf
) that is an Excel table that I have exported as a PDF. If you highlight the orange cells, it only selects the words within the cell. Whereas on the 2nd PDF (Tesseract-table.pdf
), which is simply a text layer beneath an image, if you try to select the same cell with 2 lines, it will highlight the whole row.Is there a way to create this table structure through PyMuPDF as in the first document, or is there maybe a way to extract the meta-data to have an understanding how it is separating each cell?
Thanks.
Edit: If you open the first document in Adobe Reader, copy the table and paste it in Word, it creates a table format. Is this functionality possible?
Excel-table.pdf
Tesseract-table.pdf
Beta Was this translation helpful? Give feedback.
All reactions