Extracting text from libreoffice PDFs #810

shueffner · 2021-01-05T21:06:37Z

shueffner
Jan 5, 2021

Thank you and the community for all the hard work, and the great documentation and examples!

I'd like to use pymupdf to extract text and images from PDFs created by Docsplit (so, libreoffice) by converting MS Office documents to PDF.
Is there anything I need to be aware of? Does libreoffice generally generate "nice", well-formed PDFs? Would the extracted text be naturally in "reading order"?

I'm grateful for any tips!

Answered by JorjMcKie

Jan 6, 2021

Just for your question not feeling so lonely, here some non-answer:
I haven't analyzed LibreOffice's PDF outputs yet, but did some from MS Word's.
As is to be expected, there was (almost) nothing special. Text reading sequence can be expected to be normal. Of course you need to take care if your doc pages contain multi-column text, but that is an issue independent from Word / LibreOffice.

Some text glyphs did not deliver the expected characters though, but that may be a Word peculiarity.

View full answer

JorjMcKie · 2021-01-06T08:23:02Z

JorjMcKie
Jan 6, 2021
Maintainer

Just for your question not feeling so lonely, here some non-answer:
I haven't analyzed LibreOffice's PDF outputs yet, but did some from MS Word's.
As is to be expected, there was (almost) nothing special. Text reading sequence can be expected to be normal. Of course you need to take care if your doc pages contain multi-column text, but that is an issue independent from Word / LibreOffice.

Some text glyphs did not deliver the expected characters though, but that may be a Word peculiarity.

1 reply

shueffner Jan 6, 2021
Author

That's not a non-answer, no news is good news :-)
Thank you, I appreciate it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting text from libreoffice PDFs #810

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Extracting text from libreoffice PDFs #810

shueffner Jan 5, 2021

Replies: 1 comment · 1 reply

JorjMcKie Jan 6, 2021 Maintainer

shueffner Jan 6, 2021 Author

shueffner
Jan 5, 2021

Replies: 1 comment 1 reply

JorjMcKie
Jan 6, 2021
Maintainer

shueffner Jan 6, 2021
Author