Remove a background text which is overlapped with other texts. #2823

Soumadip-Saha · 2023-11-20T10:14:04Z

Soumadip-Saha
Nov 20, 2023

I have 100 PDFs where "Confidential" is written at 45 degree angle in the middle of the pages. This particular text is selectable so when I am trying to extract the main text this is hindering the entire process and messing up my tables. I have tried to use page.add_redact_annot with the rectangular region covering "Confidential" which removes the foreground text also. I have attached the screenshot of the original PDF page as well as the redacted PDF.
Please help, I have been stuck in this problem forever. Any kind of help is really really appreciated. I have so far used:

PDF to DOC method
Other Watermark removal techniques as mentioned here Question :How to remove a word water_mark from PDF? #468

But nothing has helped so far.

This is the code I have used so far:

pdf = fitz.open(r"page.pdf")
page = pdf[0]
rect = page.search_for("Confidential", quad = True)
print(rect)
page.add_redact_annot(rect[0])
page.apply_redactions()

Please also find the attached PDF page for recreation of this issue.

page.pdf

Original Image:

Redacted Image:

Answered by JorjMcKie

Nov 20, 2023

If you think that the same watermarking approach is being always used in the 100 PDFs, you can avoid the complicated analysis above and simply hunt and destroy an Form XObject that writes "Confidential":

for xref in range(1, doc.xref_length()):  # loop over all objects in PDF
    if doc.xref_get_key(xref, "Subtype")[1] != "/Form":  # only look at Form XObjects
        continue
    stream = doc.xref_stream(xref)  # read stream of object
    # check if it writes text (BT / ET are present)
    if b"Confidential" in stream and b"BT" in stream and b"ET" in stream:
        doc.update_stream(xref, b" ")

        
doc.ez_save("cleand2.pdf")

This also does the job.
I am trying to be cautious not t…

View full answer

JorjMcKie · 2023-11-20T10:57:47Z

JorjMcKie
Nov 20, 2023
Maintainer

This is a Discussions item, so let me transfer it first.

0 replies

JorjMcKie · 2023-11-20T12:16:44Z

JorjMcKie
Nov 20, 2023
Maintainer

This is one of the zillion ways to "watermark" pages - too many to name them all.
Your case can only be resolved by quite a hacky approach.
My initial suspicion after your detailed description was that we have to look for a text object that can be reused between page or even between many PDFs.
Anyway, in this example that hypothesis confirmed. If looking at text blocks containing a single line with a single span with a text value of "Confidential":

spans = []
for b in page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]:  # only extract text
    if len(b["lines"]) != 1:  # only look for 1-line blocks
        continue
    line = b["lines"][0]
    if len(line["spans"]) != 1:  # only look for 1-span lines
        continue
    span = line["spans"][0]
    if span["text"] != "Confidential":  # span text must be "Confidential"
        continue
    spans.append((line["dir"], span))  # store writing direction and the span

len(spans)  # indeed: exactly one such span:
1

When look at the details we get this:

wdir, span = spans[0]
wdir  # the writing direction: cosine, sine of the angle
(0.6156609058380127, -0.7880111932754517)

Now I reformatted the page appearance commands to make parsing easier: page.clean_contents(sanitize=False). This causes each command to appear on its own line and the syntax in general being standardized.
The idea is to look for a PDF rotation matrix with values cosine, sine as above. Because of MuPDF's source code standardization, any float values will no longer have zeroes before the decimal point and will have at most 5 digit after the decimal point. So I formatted the cosine and sine to bytes values correspondingly:

cosine = round(wdir[0],5)
sine = round(wdir[1], 5)
cos_text = str(cosine).replace("0.",".").encode()
sin_text = str(sine).replace("0.", ".").encode()
sin_text  # should be a PDF matrix parameter:
b'-.78801'
cos_text  # should be a PDF matrix parameter:
b'.61566'

Now loop over the commands of the reformatted page appearance source:

cont_lines = page.read_contents().splitlines()
for i,line in enumerate(cont_lines):
    if not line.endswith(b" cm"):  # a PDF matrix command
        continue
    if not sin_text in line or not cos_text in line:
        continue
    print(f"line {i}, {line}")
    break

line 3242, b'.61566 .78801 -.78801 .61566 289.99 223 cm'
len(cont_lines)  # total number of appearance commands:
3251

So we landed quite at the end of the commands. Lets see how the following look like:

pprint(cont_lines[3242:])
[b'.61566 .78801 -.78801 .61566 289.99 223 cm',
 b'/Xi57 Do',
 b'Q',
 b'/Xi58 gs',
 b'q',
 b'1 0 0 1 242 10 cm',
 b'/Xi59 Do',
 b'Q',
 b'Q']

So we see: after the rotation matrix something named "/Xi57" is being invoked. Looking at the page definition, we also detect that object there:

print(doc.xref_object(page.xref))
<<
  /Type /Page
  /MediaBox [ 0 0 612 792 ]
  /Resources <<
    /Font <<
      /F1 7 0 R
      /F2 10 0 R
      /F3 13 0 R
      /F9 21 0 R
      /F6 24 0 R
    >>
    /ExtGState <<
      /GS7 32 0 R
      /GS8 33 0 R
      /Xi54 34 0 R
      /Xi56 35 0 R
      /Xi58 36 0 R
    >>
    /XObject <<
      /Image16 37 0 R
      /Xi55 40 0 R
      /Xi57 42 0 R     % <=== that's the guy!
      /Xi59 43 0 R
    >>
    /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ]
  >>
  /Contents 133 0 R
  /Group <<
    /Type /Group
    /S /Transparency
    /CS /DeviceRGB
  >>
  /Tabs /S
  /StructParents 0
  /Parent 2 0 R
>>

It is a "Form XObject" stored in xref 42. These objects have a stream, so we may be lucky to find the text writing commands for "Confidential" in there:

cont42 = doc.xref_stream(42)
b"Confidential" in cont42
True
# GOT YOU!

The rest is comparatively simple: just replace that stream with an empty one:

doc.update_stream(42, b" ")
doc.ez_save("cleaned.pdf")

This works for the example you supplied.

Apologies for being so verbose, but I warned you it would be hacky.
I also hope it became obvious that this approach by no means is bound succeed.
But maybe it is enough material for you to develop something that works for your 100 PDFs.

1 reply

JorjMcKie Nov 20, 2023
Maintainer

BTW if the PDF has multiple pages watermarked with the same object (xref 42), then they all will profit from the hack on one page.

JorjMcKie · 2023-11-20T12:47:12Z

JorjMcKie
Nov 20, 2023
Maintainer

If you think that the same watermarking approach is being always used in the 100 PDFs, you can avoid the complicated analysis above and simply hunt and destroy an Form XObject that writes "Confidential":

for xref in range(1, doc.xref_length()):  # loop over all objects in PDF
    if doc.xref_get_key(xref, "Subtype")[1] != "/Form":  # only look at Form XObjects
        continue
    stream = doc.xref_stream(xref)  # read stream of object
    # check if it writes text (BT / ET are present)
    if b"Confidential" in stream and b"BT" in stream and b"ET" in stream:
        doc.update_stream(xref, b" ")

        
doc.ez_save("cleand2.pdf")

This also does the job.
I am trying to be cautious not to remove too much of other content - maybe you have to refine the sub-selection above.

2 replies

Soumadip-Saha Nov 20, 2023
Author

Thanks a lot. It worked like a charm. I have been trying to use xref_stream for a long time but was unable to find it's functionality properly. Really thanks a lot for the help. I am quite new to streams. Appreciate it a lot.

JorjMcKie Nov 20, 2023
Maintainer

Glad it works for you!

firezym · 2024-12-04T02:21:25Z

firezym
Dec 4, 2024

I have tried the following code, which is revised a little bit from previous code in this discussion #1855. It works ok on current pymupdf release.
To EMPHANSIZE: don't use page.clean_contents(), because if u do that, u will not able to splitlines.

pip install PyMuPDF

import pymupdf

def process_page(page : pymupdf.Page):
    """Process one page."""
    # doc = page.parent  # the page's owning document
    # page.clean_contents()  # clean page painting syntax
    xref = page.get_contents()[0]  # get xref of resulting /Contents
    changed = 0  # this will be returned
    # read sanitized contents, splitted by line breaks
    cont_lines = page.read_contents().splitlines()
    print(len(cont_lines))
    # print(cont_lines)
    for i in range(len(cont_lines)):  # iterate over the lines
        line = cont_lines[i]
        # print(line)
        if not (line.startswith(b"/Artifact") and b"/Watermark" in line):
            continue  # this was not for us
        # line number i starts the definition, j ends it:
        print(line)
        j = cont_lines.index(b"EMC", i)
        for k in range(i, j):
            # look for image / xobject invocations in this line range
            do_line = cont_lines[k]
            if do_line.endswith(b"Do"):  # this invokes an image / xobject
                cont_lines[k] = b""  # remove / empty this line
                changed += 1
    if changed > 0:  # if we did anything, write back modified /Contents
        doc.update_stream(xref, b"\n".join(cont_lines))
    return changed

fpath = 'your_pdf_file_path/file_name.pdf'
doc = pymupdf.open(fpath)
changed = 0  # indicates successful removals
for page in doc:
    changed += process_page(page)  # increase number of changes
if changed > 0:
    x = "s" if doc.page_count > 1 else ""
    print(f"{changed} watermarks have been removed on {doc.page_count} page{x}.")
    doc.ez_save(doc.name.replace(".pdf", "-nowm.pdf"))
else:
    print("Nothing to change")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove a background text which is overlapped with other texts. #2823

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Remove a background text which is overlapped with other texts. #2823

Soumadip-Saha Nov 20, 2023

Replies: 4 comments · 3 replies

JorjMcKie Nov 20, 2023 Maintainer

JorjMcKie Nov 20, 2023 Maintainer

JorjMcKie Nov 20, 2023 Maintainer

JorjMcKie Nov 20, 2023 Maintainer

Soumadip-Saha Nov 20, 2023 Author

JorjMcKie Nov 20, 2023 Maintainer

firezym Dec 4, 2024

Soumadip-Saha
Nov 20, 2023

Replies: 4 comments 3 replies

JorjMcKie
Nov 20, 2023
Maintainer

JorjMcKie
Nov 20, 2023
Maintainer

JorjMcKie Nov 20, 2023
Maintainer

JorjMcKie
Nov 20, 2023
Maintainer

Soumadip-Saha Nov 20, 2023
Author

JorjMcKie Nov 20, 2023
Maintainer

firezym
Dec 4, 2024