Incorrectly parsing OCMDs leads to incomplete renderings of the page #1172
Replies: 11 comments
-
Cannot accept this as a bug, because:
|
Beta Was this translation helpful? Give feedback.
-
Thank you for your reply. To 1: Please tell me a way how I can send you the file (e. g. via email or dm) as I can not make it public (for privacy reasons). |
Beta Was this translation helpful? Give feedback.
-
Ok, then best use my e-mail address shown on the home page. |
Beta Was this translation helpful? Give feedback.
-
I still have not received your PDF ... e-mail: [email protected] |
Beta Was this translation helpful? Give feedback.
-
I have already sent it to you on 04-20-2021 15:52. Maybe it got filtered as spam? |
Beta Was this translation helpful? Give feedback.
-
Ah, you are right - was treated as spam. |
Beta Was this translation helpful? Give feedback.
-
Found the problem! Congratulations - you detected an upstream error! Background:
OCMD objects allow formulating complex logical conditions, under which something is shown or hidden. In this case, the logic is: "If all OCGs are ON, return ON." Now the problem: You can check this out by trying one of these modifications to the PDF (alternatives):
Option 1 works like so: doc.xref_set_key(36, "OCGs", "[6 0 R]") # put brackets around Option 2 works like this: doc.xref_set_key(20, "OC", "6 0 R") Of course you could also get rid of all optional content "Brimborium", and (temporarily) remove it before you extract text: doc.xref_set_key(doc.pdf_catalog(), "OCProperties", "null") This error is worth to be reported to MuPDF here: https://bugs.ghostscript.com/enter_bug.cgi If you authorize me, I can do that for you, too. |
Beta Was this translation helpful? Give feedback.
-
I will report the bug. I have created a neutral, non-confidential file to support the case. |
Beta Was this translation helpful? Give feedback.
-
This is the submitted bug report: https://bugs.ghostscript.com/show_bug.cgi?id=703798 |
Beta Was this translation helpful? Give feedback.
-
Thank you for the detailed explanation and for taking care of this bug.
|
Beta Was this translation helpful? Give feedback.
-
I am not very familiar with Ghostscript, but you converted the PDF to some PDF/A standard format. I recall that at least some of those standards do not allow optional content, so its removal may take place here. |
Beta Was this translation helpful? Give feedback.
-
Please provide all mandatory information!
Describe the bug (mandatory)
PyMuPDF is not extracting all the text that is on the pdf. There are parts that won't be recogized nor extracted. I have already checked if the not recoginzable text in the pdf is just a picture or something else other than a text but I came to the conclusion that the part that won't get extracted is a text because I can copy the text from the part and paste it when I open the pdf with a pdf-reader. The text/objects of that part are shown in Foxit-reader/Acrobat-Reader DC but not recoginzed by mupdf.
To Reproduce (mandatory)
I have executed this code:
(If you need the pdf and/or the output please tell me so i can send it to you via email or dm)
Expected behavior (optional)
All the text should have been extracted. But only a part of the pdf gets extracted (see screenshot below).
Screenshots (optional)
An example on what gets recognized/extracted (red box) in the pdf and the part that is not getting extracted/recoginzed/shown by mupdf (I can not publish the pdf):
Your configuration (mandatory)
PyMuPDF 1.18.12: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-04-10 04:00:00.
Built for Python 3.8 on win32 (64-bit).
Additional context (optional)
I am not exactly sure why this is happening but I would guess that it could be because the text got defined on the wrong part in the pdf so the pdf has some issues?
Thanks in advance
Beta Was this translation helpful? Give feedback.
All reactions