Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

Split pdf documents with coordinates #297

Open
vlkem opened this issue Oct 4, 2020 · 0 comments
Open

Split pdf documents with coordinates #297

vlkem opened this issue Oct 4, 2020 · 0 comments

Comments

@vlkem
Copy link

vlkem commented Oct 4, 2020

Hello, i want to split pdf document with this (?: *\n){2,} regex with coordinates of each block of text.
I'm using this code now:

def read_pdf(self, document_path):
		fp = open(document_path, 'rb')
		parser = PDFParser(fp)
		document = PDFDocument(parser)
		if not document.is_extractable:
			raise PDFTextExtractionNotAllowed
		rsrcmgr = PDFResourceManager()
		device = PDFDevice(rsrcmgr)
		laparams = LAParams()
		device = PDFPageAggregator(rsrcmgr, laparams=laparams)
		interpreter = PDFPageInterpreter(rsrcmgr, device)
		doc = {}
		def parse_obj(lt_objs, i, page):
			for obj in lt_objs:
				if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
					key = "%d %d %d %d %d" % (i, obj.bbox[0], page.mediabox[3] - obj.bbox[3], obj.bbox[2], page.mediabox[3] - obj.bbox[1])
					doc[key] = obj.get_text().replace('\n', '')
				elif isinstance(obj, pdfminer.layout.LTFigure):
					parse_obj(obj._objs, i, page)

		i = 1
		for page in PDFPage.create_pages(document):
			interpreter.process_page(page)
			layout = device.get_result()
			parse_obj(layout._objs, i, page)
			i += 1
		return doc

But this blocks of text not a good option for me. I want to receive dict {"page x0 y0 x1 y1" : text, ...} where is text is splitted pdf by (?: *\n){2,} regex.
How can I improve my code below to achieve the desired result?
Thanks:)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant