Replies: 1 comment
-
No, there is not. Already (Py-) MuPDF's collecting and structuring characters into spans, lines, blocks or words is a deliberate application of heuristics ... in the hope that it will make sense. But often enough, even this seemingly so obvious approach fails! If you give me a PDF page of text, I can construct one for you with identical appearance (in a PDF viewer), which refuses however to deliver any meaningful extracted text - except if you re-arrange each individual character based on its geometrical position. Look at example files textmaker.pdf and textmaker2.pdf. Try to copy / paste text from the second one with any viewer and you will see what I mean. So bottom line is, I won't go beyond what I did with layout preservation. |
Beta Was this translation helpful? Give feedback.
-
Currently, PyMuPDF only supports layout extraction from the PDF. Moreover, hence does not extract different semantic structures (Table, Section, Metadata, References, List, Header, Footer) separately.
Incorporating state-of-the-art pre-trained models can be used to improve the information extraction from PDF.
Is there any plan to support the extraction of such semantic structures in the future?
Beta Was this translation helpful? Give feedback.
All reactions