You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.
Hello, i want to split pdf document with this (?: *\n){2,} regex with coordinates of each block of text.
I'm using this code now:
def read_pdf(self, document_path):
fp = open(document_path, 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
rsrcmgr = PDFResourceManager()
device = PDFDevice(rsrcmgr)
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
doc = {}
def parse_obj(lt_objs, i, page):
for obj in lt_objs:
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
key = "%d %d %d %d %d" % (i, obj.bbox[0], page.mediabox[3] - obj.bbox[3], obj.bbox[2], page.mediabox[3] - obj.bbox[1])
doc[key] = obj.get_text().replace('\n', '')
elif isinstance(obj, pdfminer.layout.LTFigure):
parse_obj(obj._objs, i, page)
i = 1
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
layout = device.get_result()
parse_obj(layout._objs, i, page)
i += 1
return doc
But this blocks of text not a good option for me. I want to receive dict {"page x0 y0 x1 y1" : text, ...} where is text is splitted pdf by (?: *\n){2,} regex.
How can I improve my code below to achieve the desired result?
Thanks:)
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hello, i want to split pdf document with this
(?: *\n){2,}
regex with coordinates of each block of text.I'm using this code now:
But this blocks of text not a good option for me. I want to receive dict
{"page x0 y0 x1 y1" : text, ...}
where is text is splitted pdf by(?: *\n){2,}
regex.How can I improve my code below to achieve the desired result?
Thanks:)
The text was updated successfully, but these errors were encountered: