Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

pdfminer fails to extract text and co-ordinates from fields in a non-editable (i.e. flattened) PDF form #307

Open
AIMLAPP opened this issue Apr 29, 2021 · 1 comment

Comments

@AIMLAPP
Copy link

AIMLAPP commented Apr 29, 2021

Hi,

I am trying to extract all words/text as well as the co-ordinates of each word using pdfminer from filled in PDF forms that are no longer editable (i.e. they are flattened and NOT acroforms). I am only able to extract text and co-ordinates outside the fields. E.g. on the attached image, "... CAPITAL LETTERS or tick ✓ as necessary." can be extracted. But "Disneyland", "Mickey" etc can't.

As a result, with the code I am using, the words & co-ordinates extracted from a blank form, filled in Acroform, and non-editable pdf form are exactly the same due to this issue.

Is there any way to resolve this using pdfminer or any alternative packages (in the case that it cannot be resolved by pdfminer)?

The sample PDF can be found here: https://drive.google.com/file/d/1HroGrPqADRQ0_ccsIP6wHmqof0ghTdVZ/view

Here is the code:

from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator

fp = open('sample.pdf', 'rb')
manager = PDFResourceManager() 
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev) 
pages = PDFPage.get_pages(fp)

count = 0
x_list, y_list, x1_list, y1_list,text_list = [],[],[],[],[]
for page in pages:
    print('--- Processing Page ---')

    
    interpreter.process_page(page)
    layout = dev.get_result()
    x, y, x1, y1, text = -1, -1, -1, -1,''
    for textbox in layout:
        if isinstance(textbox, LTText):
          for line in textbox:
            for char in line:
              if isinstance(char, LTAnno) or char.get_text() == ' ':
                if x != -1:
                  print('At %r is text: %s' % ((x, y, x1, y1), text))
                  x_list.append(x)
                  y_list.append(y)
                  x1_list.append(x1)
                  y1_list.append(y1)
                  text_list.append(text)

                x, y, x1, y1, text = -1, -1, -1, -1, ''     
              elif isinstance(char, LTChar):
                text += char.get_text()
                if x == -1:
                  x, y, x1, y1 = char.bbox[0], char.bbox[3], char.bbox[2], char.bbox[1]                                     
                  
    if x != -1:
      print('At %r is text: %s' % ((x, y, x1, y1), text))
      x_list.append(x)
      y_list.append(y)
      x1_list.append(x1)
      y1_list.append(y1)
      text_list.append(text)

image

@pokotylo
Copy link

It is possible to extract the form fields with pdfminer.six
https://pdfminersix.readthedocs.io/en/develop/howto/acro_forms.html

By the way, your sample file is not available any more.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants