extract jointly body paragraphs and the table in the pdf #1005

zzflybird · 2023-10-09T06:26:33Z

zzflybird
Oct 9, 2023

I want to extract the body and the table in the pdf jointly, and the final result will keep the original order of the body paragraphs and the table. Do you have any good suggestions?
Because page.extract_text() also gets text from tables, but the resulting tables are styled poorly, we want to use page.extract_tables() to get the tables. However, it is not clear how to keep the table in its original position in the body, because the table may be in between body paragraphs.

Pseudo-code:

text = page.extract_text()
tables = page.extract_tables()

pdf page1:

text1

table1

text2

pdf page2:

table2

text3

text4

Answered by cmdlineluser

Oct 9, 2023

It usually helps if you could provide an example file.

I made one here: table.pdf (using fpdf2)

If you use .find_tables() you get the actual table objects which allows you to access their coords/positional values.

You could then use .outside_bbox() to filter out the tables from the page with this information.

.extract_text_lines() gives you the text line objects with coords/positional values.

With separate table and line objects you could sort based on their position in the page.

Something like:

import pdfplumber
from operator import itemgetter

page = pdfplumber.open("table.pdf").pages[0]
tables = page.find_tables()

page_without_tables = page
for table in tables:
   page_without_tables = …

View full answer

cmdlineluser · 2023-10-09T16:23:47Z

cmdlineluser
Oct 9, 2023

It usually helps if you could provide an example file.

I made one here: table.pdf (using fpdf2)

If you use .find_tables() you get the actual table objects which allows you to access their coords/positional values.

You could then use .outside_bbox() to filter out the tables from the page with this information.

.extract_text_lines() gives you the text line objects with coords/positional values.

With separate table and line objects you could sort based on their position in the page.

Something like:

import pdfplumber
from operator import itemgetter

page = pdfplumber.open("table.pdf").pages[0]
tables = page.find_tables()

page_without_tables = page
for table in tables:
   page_without_tables = page_without_tables.outside_bbox(table.bbox)
   
lines = []

for line in page_without_tables.extract_text_lines():
   lines.append({ 'top': line['top'], 'text': line['text'] })
   
for table in tables:
   for row in table.extract():
      lines.append({ 'top': table.bbox[1], 'text': row })
      
print(sorted(lines, key=itemgetter('top')))

[{'top': 28.44799999999998, 'text': 'Hello'},
 {'top': 42.44799999999998, 'text': 'world'},
 {'top': 84.35000000000002,
  'text': ['First name', 'Last name', 'Age', 'City']},
 {'top': 84.35000000000002, 'text': ['Jules', 'Smith', '34', 'San Juan']},
 {'top': 84.35000000000002, 'text': ['Mary', 'Ramos', '45', 'Orlando']},
 {'top': 84.35000000000002, 'text': ['Carlson', 'Banks', '19', 'Los Angeles']},
 {'top': 84.35000000000002,
  'text': ['Lucas', 'Cimon', '31', 'Saint-Mahturin-sur-Loire']},
 {'top': 204.44799999999998, 'text': 'Some more'},
 {'top': 218.44799999999998, 'text': 'text.'}]

0 replies

zzflybird · 2023-10-10T07:20:52Z

zzflybird
Oct 10, 2023
Author

It usually helps if you could provide an example file.

I made one here: table.pdf (using fpdf2)

If you use .find_tables() you get the actual table objects which allows you to access their coords/positional values.

You could then use .outside_bbox() to filter out the tables from the page with this information.

.extract_text_lines() gives you the text line objects with coords/positional values.

With separate table and line objects you could sort based on their position in the page.

Something like:

import pdfplumber
from operator import itemgetter

page = pdfplumber.open("table.pdf").pages[0]
tables = page.find_tables()

page_without_tables = page
for table in tables:
   page_without_tables = page_without_tables.outside_bbox(table.bbox)
   
lines = []

for line in page_without_tables.extract_text_lines():
   lines.append({ 'top': line['top'], 'text': line['text'] })
   
for table in tables:
   for row in table.extract():
      lines.append({ 'top': table.bbox[1], 'text': row })
      
print(sorted(lines, key=itemgetter('top')))

[{'top': 28.44799999999998, 'text': 'Hello'},
 {'top': 42.44799999999998, 'text': 'world'},
 {'top': 84.35000000000002,
  'text': ['First name', 'Last name', 'Age', 'City']},
 {'top': 84.35000000000002, 'text': ['Jules', 'Smith', '34', 'San Juan']},
 {'top': 84.35000000000002, 'text': ['Mary', 'Ramos', '45', 'Orlando']},
 {'top': 84.35000000000002, 'text': ['Carlson', 'Banks', '19', 'Los Angeles']},
 {'top': 84.35000000000002,
  'text': ['Lucas', 'Cimon', '31', 'Saint-Mahturin-sur-Loire']},
 {'top': 204.44799999999998, 'text': 'Some more'},
 {'top': 218.44799999999998, 'text': 'text.'}]

Thank you so much for your help, using the body and table height information for sorting to keep order was something I hadn't thought of, fantastic.

For PDF documents in the Chinese text, the use of .extract_text_lines() extraction, a complete paragraph text will be cut into multiple strings, I was able to get coordinate information from the results of .extract_text_lines() to determine whether the end of the paragraph.

Your answer has inspired me, thanks again!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract jointly body paragraphs and the table in the pdf #1005

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

extract jointly body paragraphs and the table in the pdf #1005

zzflybird Oct 9, 2023

Replies: 2 comments

cmdlineluser Oct 9, 2023

zzflybird Oct 10, 2023 Author

zzflybird
Oct 9, 2023

cmdlineluser
Oct 9, 2023

zzflybird
Oct 10, 2023
Author