Find and remove watermarks in PDF file #1855
-
I am currently tring to use PyMuPDF to remove watermarks in PDF files. For example, I have a file like this: The code I used for extracting is like this: document = fitz.open(self.input)
for each_page in document:
image_list = each_page.getImageList()
for image_info in image_list:
pix = fitz.Pixmap(document, image_info[0])
png = pix.tobytes() # return picture in png format
if png == watermark_image:
document._deleteObject(image_info[0])
document.save(out_filename) Also I tried to check the page's other attributes: no annotations, no links, no widgets. I have no idea how the mark is stored. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 13 replies
-
The watermark in you example file are stored as so-called marked-content 1. Determine presence of marked-content watermarksFirst standardize the page's page.clean_contents()
xref = page.get_contents()[0] # get xref of resulting /Contents object
cont = bytearray(page.read_contents()) # read the contents source as a (modifyable) bytearray
if cont.find(b"/Subtype/Watermark") > 0: # this will confirm a marked-content watermark is present
print("marked-content watermark present") 2. Remove marked-content watermarksAfter confirmation in previous step, we "edit" the source and remove all such definitions. Because of source standardization, we can rely on a predictable layout. Every watermark in your example looks like this: q
/Artifact <</Subtype/Watermark/Type/Pagination>> BDC
.573 .816 .314 rg
/Fm1 Do
Q
EMC "Fm1" is the first of those 10 Chinese characters in the green diagonal text. The green color is coded as while True:
i1 = cont.find(b"/Artifact") # start of definition
if i1 < 0: break # none more left: done
i2 = cont.find(b"EMC", i1) # end of definition
cont[i1-2 : i2+3] = b"" # remove the full definition source "q ... EMC"
doc.update_stream(xref, cont) # replace the original source
doc.ez_save("x.pdf") # save to new file |
Beta Was this translation helpful? Give feedback.
-
This new example indeed is no watermark at all. It technically is so-called "line art": elementary drawings of lines and curves forming Chinese letters. Your previous example also had these things, but there the drawings were coded inside separate PDF objects (Form XObjects) and then referenced by the Here, the drawings are made directly on the page. You can extract them (via |
Beta Was this translation helpful? Give feedback.
-
@Jason-XII Could you share your complete code to remove a watermark? |
Beta Was this translation helpful? Give feedback.
-
I have tried the following code, which is revised a little bit from previous code in this discussion. It works ok on current pymupdf release.
import pymupdf
def process_page(page : pymupdf.Page):
"""Process one page."""
# doc = page.parent # the page's owning document
# page.clean_contents() # clean page painting syntax
xref = page.get_contents()[0] # get xref of resulting /Contents
changed = 0 # this will be returned
# read sanitized contents, splitted by line breaks
cont_lines = page.read_contents().splitlines()
print(len(cont_lines))
# print(cont_lines)
for i in range(len(cont_lines)): # iterate over the lines
line = cont_lines[i]
# print(line)
if not (line.startswith(b"/Artifact") and b"/Watermark" in line):
continue # this was not for us
# line number i starts the definition, j ends it:
print(line)
j = cont_lines.index(b"EMC", i)
for k in range(i, j):
# look for image / xobject invocations in this line range
do_line = cont_lines[k]
if do_line.endswith(b"Do"): # this invokes an image / xobject
cont_lines[k] = b"" # remove / empty this line
changed += 1
if changed > 0: # if we did anything, write back modified /Contents
doc.update_stream(xref, b"\n".join(cont_lines))
return changed
fpath = 'your_pdf_file_path/file_name.pdf'
doc = pymupdf.open(fpath)
changed = 0 # indicates successful removals
for page in doc:
changed += process_page(page) # increase number of changes
if changed > 0:
x = "s" if doc.page_count > 1 else ""
print(f"{changed} watermarks have been removed on {doc.page_count} page{x}.")
doc.ez_save(doc.name.replace(".pdf", "-nowm.pdf"))
else:
print("Nothing to change") |
Beta Was this translation helpful? Give feedback.
The watermark in you example file are stored as so-called marked-content
/Artifacts
.There is no direct, dedicated high-level function in PyMuPDF to deal with these object types.
But you can use PyMuPDF's low-level interface to locate and remove them if you follow a strict procedure.
1. Determine presence of marked-content watermarks
First standardize the page's
/Contents
objects. This will produce a predictable source code structure - and also repair any potential issues. There also will be left over only one such object.Then confirm the presence of this watermark type.