-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean_contents with sanitize=True removes page contents #4130
Comments
There has been a related bug in the base library - which should have been fixed though. Going to take another look in any case. But your motivation for using page clean at all seems to relate to misplaced insertion positions - guessing correctly? |
Hm, MuPDF's cli mutool indeed does not do it correctly (so PyMuPDF has no chance to do this any better). |
Here is MuPDF's issue link: https://bugs.ghostscript.com/show_bug.cgi?id=708186 |
So bottom line: |
Thanks so much!
Unfortunately I added this to our code over six months ago, and I don't remember the details. The commit message says "Run In that same commit I also updated PyMuPDF from 1.23.26 to 1.24.2, and hopefully I reproduced the issue with that newer version, but I can't make any guarantees. I'm going to use |
After I made the change to set If I do I could maybe do It looks like MuPDF already made a release with a fix for the latter issue. Safe to assume that's coming to PyMuPDF soon? |
Let me restate what happens or does not happen in the 3 alternatives:
|
I get If I use I was able to make (by grabbing a PII-free page from a problematic doc) an example that reproduces this problem: FzErrorArgument_single_page.pdf Happy to open a new issue, if you'd like, since this is starting to feel like a separate topic. |
Interesting! I get the error when I'm using a TextWriter, but I can use import pymupdf
original_pdf = pymupdf.open("FzErrorArgument_single_page.pdf")
page = original_pdf[0]
print("original text = ", page.get_text("text"))
text_writer = pymupdf.TextWriter(page.rect)
text_writer.append((100, 100), "Hallo", fontsize=20)
text_writer.write_text(page, color=(1,0,0), opacity=1) Edit - I replicated your result successfully. I'll switch to using Second edit - it looks like |
In the meantime I detected that this is a separate bug. The peculiarity of your page is that it has no |
Description of the bug
We've encountered a number of PDFs recently (all from the same source, so I suspect this is specific to a quirk of their format) where calling
clean_contents()
removes all visible page content. I found that settingsanitize=False
causes the content to be retained.Our process requires adding text to the PDFs, and we call
clean_contents()
because we've found that without that, text sometimes isn't successfully added.I'm happy to add
sanitize=False
to our code if this isn't a bug. Thanks for taking a look!How to reproduce the bug
single_page.pdf
PyMuPDF version
1.25.0
Operating system
MacOS
Python version
3.11
The text was updated successfully, but these errors were encountered: