pdfminer fails to extract text and co-ordinates from fields in a non-editable (i.e. flattened) PDF form #307

AIMLAPP · 2021-04-29T00:51:37Z

Hi,

I am trying to extract all words/text as well as the co-ordinates of each word using pdfminer from filled in PDF forms that are no longer editable (i.e. they are flattened and NOT acroforms). I am only able to extract text and co-ordinates outside the fields. E.g. on the attached image, "... CAPITAL LETTERS or tick ✓ as necessary." can be extracted. But "Disneyland", "Mickey" etc can't.

As a result, with the code I am using, the words & co-ordinates extracted from a blank form, filled in Acroform, and non-editable pdf form are exactly the same due to this issue.

Is there any way to resolve this using pdfminer or any alternative packages (in the case that it cannot be resolved by pdfminer)?

The sample PDF can be found here: https://drive.google.com/file/d/1HroGrPqADRQ0_ccsIP6wHmqof0ghTdVZ/view

Here is the code:

from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator

fp = open('sample.pdf', 'rb')
manager = PDFResourceManager() 
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev) 
pages = PDFPage.get_pages(fp)

count = 0
x_list, y_list, x1_list, y1_list,text_list = [],[],[],[],[]
for page in pages:
    print('--- Processing Page ---')

    
    interpreter.process_page(page)
    layout = dev.get_result()
    x, y, x1, y1, text = -1, -1, -1, -1,''
    for textbox in layout:
        if isinstance(textbox, LTText):
          for line in textbox:
            for char in line:
              if isinstance(char, LTAnno) or char.get_text() == ' ':
                if x != -1:
                  print('At %r is text: %s' % ((x, y, x1, y1), text))
                  x_list.append(x)
                  y_list.append(y)
                  x1_list.append(x1)
                  y1_list.append(y1)
                  text_list.append(text)

                x, y, x1, y1, text = -1, -1, -1, -1, ''     
              elif isinstance(char, LTChar):
                text += char.get_text()
                if x == -1:
                  x, y, x1, y1 = char.bbox[0], char.bbox[3], char.bbox[2], char.bbox[1]                                     
                  
    if x != -1:
      print('At %r is text: %s' % ((x, y, x1, y1), text))
      x_list.append(x)
      y_list.append(y)
      x1_list.append(x1)
      y1_list.append(y1)
      text_list.append(text)

The text was updated successfully, but these errors were encountered:

pokotylo · 2022-08-10T15:30:19Z

It is possible to extract the form fields with pdfminer.six
https://pdfminersix.readthedocs.io/en/develop/howto/acro_forms.html

By the way, your sample file is not available any more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdfminer fails to extract text and co-ordinates from fields in a non-editable (i.e. flattened) PDF form #307

pdfminer fails to extract text and co-ordinates from fields in a non-editable (i.e. flattened) PDF form #307

AIMLAPP commented Apr 29, 2021 •

edited

Loading

pokotylo commented Aug 10, 2022

pdfminer fails to extract text and co-ordinates from fields in a non-editable (i.e. flattened) PDF form #307

pdfminer fails to extract text and co-ordinates from fields in a non-editable (i.e. flattened) PDF form #307

Comments

AIMLAPP commented Apr 29, 2021 • edited Loading

pokotylo commented Aug 10, 2022

AIMLAPP commented Apr 29, 2021 •

edited

Loading