Remove a background text which is overlapped with other texts. #2823
-
I have 100 PDFs where "Confidential" is written at 45 degree angle in the middle of the pages. This particular text is selectable so when I am trying to extract the main text this is hindering the entire process and messing up my tables. I have tried to use
But nothing has helped so far. This is the code I have used so far:
Please also find the attached PDF page for recreation of this issue. Original Image: Redacted Image: |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 3 replies
-
This is a Discussions item, so let me transfer it first. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
If you think that the same watermarking approach is being always used in the 100 PDFs, you can avoid the complicated analysis above and simply hunt and destroy an Form XObject that writes "Confidential": for xref in range(1, doc.xref_length()): # loop over all objects in PDF
if doc.xref_get_key(xref, "Subtype")[1] != "/Form": # only look at Form XObjects
continue
stream = doc.xref_stream(xref) # read stream of object
# check if it writes text (BT / ET are present)
if b"Confidential" in stream and b"BT" in stream and b"ET" in stream:
doc.update_stream(xref, b" ")
doc.ez_save("cleand2.pdf") This also does the job. |
Beta Was this translation helpful? Give feedback.
-
I have tried the following code, which is revised a little bit from previous code in this discussion #1855. It works ok on current pymupdf release.
import pymupdf
def process_page(page : pymupdf.Page):
"""Process one page."""
# doc = page.parent # the page's owning document
# page.clean_contents() # clean page painting syntax
xref = page.get_contents()[0] # get xref of resulting /Contents
changed = 0 # this will be returned
# read sanitized contents, splitted by line breaks
cont_lines = page.read_contents().splitlines()
print(len(cont_lines))
# print(cont_lines)
for i in range(len(cont_lines)): # iterate over the lines
line = cont_lines[i]
# print(line)
if not (line.startswith(b"/Artifact") and b"/Watermark" in line):
continue # this was not for us
# line number i starts the definition, j ends it:
print(line)
j = cont_lines.index(b"EMC", i)
for k in range(i, j):
# look for image / xobject invocations in this line range
do_line = cont_lines[k]
if do_line.endswith(b"Do"): # this invokes an image / xobject
cont_lines[k] = b"" # remove / empty this line
changed += 1
if changed > 0: # if we did anything, write back modified /Contents
doc.update_stream(xref, b"\n".join(cont_lines))
return changed
fpath = 'your_pdf_file_path/file_name.pdf'
doc = pymupdf.open(fpath)
changed = 0 # indicates successful removals
for page in doc:
changed += process_page(page) # increase number of changes
if changed > 0:
x = "s" if doc.page_count > 1 else ""
print(f"{changed} watermarks have been removed on {doc.page_count} page{x}.")
doc.ez_save(doc.name.replace(".pdf", "-nowm.pdf"))
else:
print("Nothing to change") |
Beta Was this translation helpful? Give feedback.
If you think that the same watermarking approach is being always used in the 100 PDFs, you can avoid the complicated analysis above and simply hunt and destroy an Form XObject that writes "Confidential":
This also does the job.
I am trying to be cautious not t…