extract jointly body paragraphs and the table in the pdf #1005
-
I want to extract the body and the table in the pdf jointly, and the final result will keep the original order of the body paragraphs and the table. Do you have any good suggestions? Pseudo-code:
pdf page1: text1 table1 text2 pdf page2: table2 text3 text4 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
It usually helps if you could provide an example file. I made one here: table.pdf (using If you use You could then use
With separate table and line objects you could sort based on their position in the page. Something like: import pdfplumber
from operator import itemgetter
page = pdfplumber.open("table.pdf").pages[0]
tables = page.find_tables()
page_without_tables = page
for table in tables:
page_without_tables = page_without_tables.outside_bbox(table.bbox)
lines = []
for line in page_without_tables.extract_text_lines():
lines.append({ 'top': line['top'], 'text': line['text'] })
for table in tables:
for row in table.extract():
lines.append({ 'top': table.bbox[1], 'text': row })
print(sorted(lines, key=itemgetter('top'))) [{'top': 28.44799999999998, 'text': 'Hello'},
{'top': 42.44799999999998, 'text': 'world'},
{'top': 84.35000000000002,
'text': ['First name', 'Last name', 'Age', 'City']},
{'top': 84.35000000000002, 'text': ['Jules', 'Smith', '34', 'San Juan']},
{'top': 84.35000000000002, 'text': ['Mary', 'Ramos', '45', 'Orlando']},
{'top': 84.35000000000002, 'text': ['Carlson', 'Banks', '19', 'Los Angeles']},
{'top': 84.35000000000002,
'text': ['Lucas', 'Cimon', '31', 'Saint-Mahturin-sur-Loire']},
{'top': 204.44799999999998, 'text': 'Some more'},
{'top': 218.44799999999998, 'text': 'text.'}] |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for your help, using the body and table height information for sorting to keep order was something I hadn't thought of, fantastic. For PDF documents in the Chinese text, the use of .extract_text_lines() extraction, a complete paragraph text will be cut into multiple strings, I was able to get coordinate information from the results of .extract_text_lines() to determine whether the end of the paragraph. Your answer has inspired me, thanks again! |
Beta Was this translation helpful? Give feedback.
It usually helps if you could provide an example file.
I made one here: table.pdf (using
fpdf2
)If you use
.find_tables()
you get the actual table objects which allows you to access their coords/positional values.You could then use
.outside_bbox()
to filter out the tables from the page with this information..extract_text_lines()
gives you the text line objects with coords/positional values.With separate table and line objects you could sort based on their position in the page.
Something like: