Pymupdf output different in google colab vs jupyter notebook #773

JSchoonmaker · 2020-12-16T18:50:18Z

JSchoonmaker
Dec 16, 2020

I've had great success running pymupdf to extract text from academic articles on my local machine (running jupyter notebook in windows). The process I use starts like this:

for path in file_list:
        try:
            doc = fitz.open(path)
            blocks = []
            for page in doc: 
                blocks += page.getText("blocks")

            for block in blocks:
                text = block[4] 
                print(text)

More text processing steps follow, but nothing else that uses pymupdf.

I am trying to migrate to Google Colab, but when I process the same pdfs on there using the same functions, instead of text output, I get lots of the following (or similar):

<image: ICCBased(Gray,Generic Gray Gamma 2.2 Profile), width 1694, height 2200, bpc 1>
<image: ICCBased(Gray,Generic Gray Gamma 2.2 Profile), width 1694, height 2200, bpc 1>
<image: ICCBased(Gray,Generic Gray Gamma 2.2 Profile), width 1694, height 2200, bpc 1>

I'm guessing this is a library or dependency missing and not a fault of Pymupdf - anyone else run into this when using pymupdf on colab? Do I need to use the pymupdf image processing steps instead of using getText, for some reason?

Thanks for any ideas!

Answered by JorjMcKie

Dec 16, 2020

The text lines you mention represent image metadata, which this getText output variant produces for each encountered image of the page.
This can happen only if

a page indeed has images
the flags parameter of page.getText() requests that any images of the page should be included: flags & fitz.TEXT_PRESERVE_IMAGES is True. This option is set by default for "blocks".

A different behaviour therefore must go back to either a different flags setting or documents that are indeed not equal.

View full answer

JorjMcKie · 2020-12-16T22:05:47Z

JorjMcKie
Dec 16, 2020
Maintainer

The text lines you mention represent image metadata, which this getText output variant produces for each encountered image of the page.
This can happen only if

a page indeed has images
the flags parameter of page.getText() requests that any images of the page should be included: flags & fitz.TEXT_PRESERVE_IMAGES is True. This option is set by default for "blocks".

A different behaviour therefore must go back to either a different flags setting or documents that are indeed not equal.

3 replies

JorjMcKie Dec 16, 2020
Maintainer

to make sure you do not see image information like this, explicitely set flags=0. This also speeds up the processing significantly if lots / large images are present.

JSchoonmaker Dec 17, 2020
Author

Thank you for the quick reply!

Setting flags=0 did get rid of the image information, but didn't change the amount of text extracted, as you seemed to expect.

The only difference I know of between the source pdf files is that one is stored locally and the other was uploaded to my google drive just before running pymupdf on it. I haven't been able to find anything indicating that cloud vs local storage makes a difference in pdf formatting - I assumed one is an exact copy of the other, but perhaps not.

I had hoped to use google colab w/my team but will just avoid it until I can tackle this further. Thanks again!

JorjMcKie Dec 17, 2020
Maintainer

I assumed one is an exact copy of the other, but perhaps not.

Weird - never heard of something like that. But maybe PDFs get converted during the upload to make them conformant to whatever standard for online storage.
What if you do not upload the PDFs directly but pack them loacally in a ZIP and instruct your users to unzip them before processing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pymupdf output different in google colab vs jupyter notebook #773

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Pymupdf output different in google colab vs jupyter notebook #773

JSchoonmaker Dec 16, 2020

Replies: 1 comment · 3 replies

JorjMcKie Dec 16, 2020 Maintainer

JorjMcKie Dec 16, 2020 Maintainer

JSchoonmaker Dec 17, 2020 Author

JorjMcKie Dec 17, 2020 Maintainer

JSchoonmaker
Dec 16, 2020

Replies: 1 comment 3 replies

JorjMcKie
Dec 16, 2020
Maintainer

JorjMcKie Dec 16, 2020
Maintainer

JSchoonmaker Dec 17, 2020
Author

JorjMcKie Dec 17, 2020
Maintainer