Extracting text from libreoffice PDFs #810
-
Thank you and the community for all the hard work, and the great documentation and examples! I'd like to use pymupdf to extract text and images from PDFs created by Docsplit (so, libreoffice) by converting MS Office documents to PDF. I'm grateful for any tips! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Just for your question not feeling so lonely, here some non-answer: Some text glyphs did not deliver the expected characters though, but that may be a Word peculiarity. |
Beta Was this translation helpful? Give feedback.
Just for your question not feeling so lonely, here some non-answer:
I haven't analyzed LibreOffice's PDF outputs yet, but did some from MS Word's.
As is to be expected, there was (almost) nothing special. Text reading sequence can be expected to be normal. Of course you need to take care if your doc pages contain multi-column text, but that is an issue independent from Word / LibreOffice.
Some text glyphs did not deliver the expected characters though, but that may be a Word peculiarity.