Pymupdf output different in google colab vs jupyter notebook #773
-
I've had great success running pymupdf to extract text from academic articles on my local machine (running jupyter notebook in windows). The process I use starts like this:
More text processing steps follow, but nothing else that uses pymupdf. I am trying to migrate to Google Colab, but when I process the same pdfs on there using the same functions, instead of text output, I get lots of the following (or similar):
I'm guessing this is a library or dependency missing and not a fault of Pymupdf - anyone else run into this when using pymupdf on colab? Do I need to use the pymupdf image processing steps instead of using getText, for some reason? Thanks for any ideas! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
The text lines you mention represent image metadata, which this getText output variant produces for each encountered image of the page.
A different behaviour therefore must go back to either a different |
Beta Was this translation helpful? Give feedback.
The text lines you mention represent image metadata, which this getText output variant produces for each encountered image of the page.
This can happen only if
flags
parameter ofpage.getText()
requests that any images of the page should be included:flags & fitz.TEXT_PRESERVE_IMAGES
isTrue
. This option is set by default for "blocks".A different behaviour therefore must go back to either a different
flags
setting or documents that are indeed not equal.