Incorrectly parsing OCMDs leads to incomplete renderings of the page #1172

ffc01 · 2021-04-20T13:02:44Z

ffc01
Apr 20, 2021

Please provide all mandatory information!

Describe the bug (mandatory)

PyMuPDF is not extracting all the text that is on the pdf. There are parts that won't be recogized nor extracted. I have already checked if the not recoginzable text in the pdf is just a picture or something else other than a text but I came to the conclusion that the part that won't get extracted is a text because I can copy the text from the part and paste it when I open the pdf with a pdf-reader. The text/objects of that part are shown in Foxit-reader/Acrobat-Reader DC but not recoginzed by mupdf.

To Reproduce (mandatory)

I have executed this code:

import fitz

pdf_document = "mypdf.pdf"
doc = fitz.open(pdf_document)

page1 = doc.loadPage(0)

page1text = page1.getText("text", flags=0)

print(page1text)

(If you need the pdf and/or the output please tell me so i can send it to you via email or dm)

Expected behavior (optional)

All the text should have been extracted. But only a part of the pdf gets extracted (see screenshot below).

Screenshots (optional)

An example on what gets recognized/extracted (red box) in the pdf and the part that is not getting extracted/recoginzed/shown by mupdf (I can not publish the pdf):

Your configuration (mandatory)

PyMuPDF 1.18.12: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-04-10 04:00:00.
Built for Python 3.8 on win32 (64-bit).

Additional context (optional)

I am not exactly sure why this is happening but I would guess that it could be because the text got defined on the wrong part in the pdf so the pdf has some issues?

Thanks in advance

JorjMcKie · 2021-04-20T13:16:29Z

JorjMcKie
Apr 20, 2021
Maintainer

Cannot accept this as a bug, because:

Cannot reproduce w/o the file.
What did you do to ensure the blue part really is extractable text at all, and not e.g. an image? Do e.g. viewers like Adobe offer this part as selectable with the mouse?

0 replies

ffc01 · 2021-04-20T13:39:48Z

ffc01
Apr 20, 2021
Author

Thank you for your reply.

To 1: Please tell me a way how I can send you the file (e. g. via email or dm) as I can not make it public (for privacy reasons).
To 2: Yes, the text is selectable (and copyable e. g. I have checked if i could paste the copied text into a text file).

0 replies

JorjMcKie · 2021-04-20T13:44:11Z

JorjMcKie
Apr 20, 2021
Maintainer

Ok, then best use my e-mail address shown on the home page.

0 replies

JorjMcKie · 2021-04-22T08:06:37Z

JorjMcKie
Apr 22, 2021
Maintainer

I still have not received your PDF ... e-mail: [email protected]

0 replies

ffc01 · 2021-04-22T08:35:02Z

ffc01
Apr 22, 2021
Author

I have already sent it to you on 04-20-2021 15:52. Maybe it got filtered as spam?
I have just sent the pdf again.

0 replies

JorjMcKie · 2021-04-22T08:54:19Z

JorjMcKie
Apr 22, 2021
Maintainer

Ah, you are right - was treated as spam.
Thanks!

0 replies

JorjMcKie · 2021-04-22T10:20:56Z

JorjMcKie
Apr 22, 2021
Maintainer

Found the problem! Congratulations - you detected an upstream error!

Background:

The blue part of the page is the content of a so-called "Form XObject". The page does not show this content itself, but instead invokes the xobject to do this.
This xobject can be set ON or OFF via the Optional Content mechanisms: it is defined via its PDF key /OC.
In this case, /OC point to an /OCMD (optional content membership dictionary), which in turn points to one OCG (optional content group). The /OCMD object invoked by the xobject is defined so:

>>> print(doc.xref_object(36))
<<
  /OCGs 6 0 R
  /P /AllOn
  /Type /OCMD
>>
>>>

OCMD objects allow formulating complex logical conditions, under which something is shown or hidden. In this case, the logic is: "If all OCGs are ON, return ON."
As you can see, this is rather redundant here, because there is only one OCG: object xref 6 0 R. The PDF creator's intention could have been satisfied by directly using OCG 6 0 R in the form xobject's /OC key.

Now the problem:
The /OCGs key of an OCMD can be a dictionary (this case) or an array of dictionaries. MuPDF however, only takes the right decision, if this entry is an array, and ignores it otherwise.

You can check this out by trying one of these modifications to the PDF (alternatives):

Change the OCMD such, that there is an array in its /OCGs key
Set the /OC key of the XObject to point to 6 0 R instead to 36 0 R

Option 1 works like so:

doc.xref_set_key(36, "OCGs", "[6 0 R]")  # put  brackets around

Option 2 works like this:

doc.xref_set_key(20, "OC", "6 0 R")

Of course you could also get rid of all optional content "Brimborium", and (temporarily) remove it before you extract text:

doc.xref_set_key(doc.pdf_catalog(), "OCProperties", "null")

This error is worth to be reported to MuPDF here: https://bugs.ghostscript.com/enter_bug.cgi
If you do, you must be ready to include some example PDF. I am absolutely sure, they will treat it confidential.

If you authorize me, I can do that for you, too.

0 replies

JorjMcKie · 2021-04-22T11:31:46Z

JorjMcKie
Apr 22, 2021
Maintainer

If you authorize me, I can do that for you, too.

I will report the bug. I have created a neutral, non-confidential file to support the case.

0 replies

JorjMcKie · 2021-04-22T11:49:05Z

JorjMcKie
Apr 22, 2021
Maintainer

This is the submitted bug report: https://bugs.ghostscript.com/show_bug.cgi?id=703798

0 replies

ffc01 · 2021-04-22T11:55:44Z

ffc01
Apr 22, 2021
Author

Thank you for the detailed explanation and for taking care of this bug.
Meanwhile I have also found another solution for this (I do not know for sure why this works and if this is recommendable) where I create a new pdf by converting the current pdf to a pdf-A3 with ghostscript like this:

gs -dPDFA=3 -dBATCH -dNOPAUSE -sColorConversionStrategy=RGB -sDEVICE=pdfwrite -sOutputFile=mynewdocument.pdf mycurrentdocument.pdf

0 replies

JorjMcKie · 2021-04-22T12:35:57Z

JorjMcKie
Apr 22, 2021
Maintainer

I am not very familiar with Ghostscript, but you converted the PDF to some PDF/A standard format. I recall that at least some of those standards do not allow optional content, so its removal may take place here.
This is equivalent to my above suggestion doc.xref_set_key(doc.pdf_catalog(), "OCProperties", "null"), which does not need an extra intermediate file, but only "prepares" the file for subsequent complete text extraction.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrectly parsing OCMDs leads to incomplete renderings of the page #1172

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Incorrectly parsing OCMDs leads to incomplete renderings of the page #1172

ffc01 Apr 20, 2021

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Additional context (optional)

Replies: 11 comments

JorjMcKie Apr 20, 2021 Maintainer

ffc01 Apr 20, 2021 Author

JorjMcKie Apr 20, 2021 Maintainer

JorjMcKie Apr 22, 2021 Maintainer

ffc01 Apr 22, 2021 Author

JorjMcKie Apr 22, 2021 Maintainer

JorjMcKie Apr 22, 2021 Maintainer

JorjMcKie Apr 22, 2021 Maintainer

JorjMcKie Apr 22, 2021 Maintainer

ffc01 Apr 22, 2021 Author

JorjMcKie Apr 22, 2021 Maintainer

ffc01
Apr 20, 2021

JorjMcKie
Apr 20, 2021
Maintainer

ffc01
Apr 20, 2021
Author

JorjMcKie
Apr 20, 2021
Maintainer

JorjMcKie
Apr 22, 2021
Maintainer

ffc01
Apr 22, 2021
Author

JorjMcKie
Apr 22, 2021
Maintainer

JorjMcKie
Apr 22, 2021
Maintainer

JorjMcKie
Apr 22, 2021
Maintainer

JorjMcKie
Apr 22, 2021
Maintainer

ffc01
Apr 22, 2021
Author

JorjMcKie
Apr 22, 2021
Maintainer