Split pdf documents with coordinates #297

vlkem · 2020-10-04T22:15:44Z

Hello, i want to split pdf document with this (?: *\n){2,} regex with coordinates of each block of text.
I'm using this code now:

def read_pdf(self, document_path):
		fp = open(document_path, 'rb')
		parser = PDFParser(fp)
		document = PDFDocument(parser)
		if not document.is_extractable:
			raise PDFTextExtractionNotAllowed
		rsrcmgr = PDFResourceManager()
		device = PDFDevice(rsrcmgr)
		laparams = LAParams()
		device = PDFPageAggregator(rsrcmgr, laparams=laparams)
		interpreter = PDFPageInterpreter(rsrcmgr, device)
		doc = {}
		def parse_obj(lt_objs, i, page):
			for obj in lt_objs:
				if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
					key = "%d %d %d %d %d" % (i, obj.bbox[0], page.mediabox[3] - obj.bbox[3], obj.bbox[2], page.mediabox[3] - obj.bbox[1])
					doc[key] = obj.get_text().replace('\n', '')
				elif isinstance(obj, pdfminer.layout.LTFigure):
					parse_obj(obj._objs, i, page)

		i = 1
		for page in PDFPage.create_pages(document):
			interpreter.process_page(page)
			layout = device.get_result()
			parse_obj(layout._objs, i, page)
			i += 1
		return doc

But this blocks of text not a good option for me. I want to receive dict {"page x0 y0 x1 y1" : text, ...} where is text is splitted pdf by (?: *\n){2,} regex.
How can I improve my code below to achieve the desired result?
Thanks:)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split pdf documents with coordinates #297

Split pdf documents with coordinates #297

vlkem commented Oct 4, 2020

Split pdf documents with coordinates #297

Split pdf documents with coordinates #297

Comments

vlkem commented Oct 4, 2020