Incorrectly recognize borderlines with different weights #941

bdthanh · 2023-07-21T16:48:31Z

bdthanh
Jul 21, 2023

Hi, it's me again. Thanks for helping me with the previous problem. I am currently having this pdf where the table lines has different weights. If I add this code to remove invisible lines:

def reject_2d_rects(obj):
    is_rect = obj["object_type"] == "rect"
    is_thin = obj["width"] < 1 or obj["height"] < 1
    return not (is_rect and not is_thin)

then the table extractor cannot extract lines with heavy weights, see following image:

Here is the pdf file that I used:
ID.pdf

If I remove that 4 lines then the result look like this:

The code I used:

import pdfplumber

def keep_visible_lines(obj):
    if obj['object_type'] == 'rect':
        return obj['non_stroking_color'] is None
    return True
  
# This function removes the invisible lines
def reject_2d_rects(obj):
    is_rect = obj["object_type"] == "rect"
    is_thin = obj["width"] < 1 or obj["height"] < 1
    return not (is_rect and not is_thin) 

pdf = pdfplumber.open("ID.pdf")
page0 = pdf.pages[0]
page0 = page0.filter(reject_2d_rects)

ts = {"vertical_strategy": "lines", "horizontal_strategy": "lines"}

# This code saves the debug visual output.
im = page0.to_image(resolution=200)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")

# Extract the tables.
tables = page0.extract_tables(table_settings=ts)
for table in tables:
    print()
    for row in table:
        print(row)

Thanks for your attention!

cmdlineluser · 2023-07-21T17:55:38Z

cmdlineluser
Jul 21, 2023

Experimenting with snap_tolerance - it seems 5 extracts the table "cleanly":

>>> pdf.pages[0].extract_table(table_settings={"snap_tolerance": 5})
[['ID', 'Scenario\n<Description>', 'Likelihood', 'Impact\n<Impact>'],
 ['1', 'Description Description\n1', 'Likelihood 1', 'Impact 1'],
 ['2', 'Description Description\n2', 'Likelihood 2', 'Impact 2'],
 ['3', 'Description 3', 'Likelihood 3', 'Impact 3']]

4 replies

bdthanh Jul 22, 2023
Author

Thanks for your response. Although this solution can fix that issue, it cannot apply to my other problem (invisible lines,...). Is there any solution that can solve both of these document? Here is an example pdf:
Sample 7 (header).pdf

If I use snap_tolerance and remove the function reject_2d_rects (mentioned above), then the first problem can be solved but not the second. But if I keep both, then the second issue is solved not the first. I want to find a way to solve both of them. Thanks for your help

cmdlineluser Jul 22, 2023

Yeah, that is one of the issues unfortunately, some settings break other tables.

That is one of the reasons I was experimenting with trying to find a more general approach.

The approach from #931 also seems to work for both examples here:

tables = []
for page in pdf.pages:
    filtered_page = remove_nested_rects(page)

    for table in filtered_page.find_tables():
        table = filtered_page.crop(table.bbox).extract_table(dict(
            explicit_horizontal_lines = [table.bbox[1], table.bbox[3]],
            explicit_vertical_lines = [table.bbox[0], table.bbox[2]]
        ))
        tables.append(pd.DataFrame(table))
        
df = pd.concat(tables, ignore_index=True)

# combine multi-page rows
(df.groupby((df[0] != '').cumsum(), as_index=False, sort=False)
   .agg(dict.fromkeys(df.columns, ' '.join))
)

     0                                                  1                                        2   ...                                                 12                             13              14
0  I\nD                                Business\nObjective                           Risk\nScenario  ...                    Status of\nMitigations\nActions  Risk\nRe-\nevaluat\nion\ndate  Targ\net\nRisk
1     1                                       IT\nSecurity                             Data\nBreach  ...  a) Impleme\nnting of\nDAM is\nin\nprogress,\nt...                     Jul-\n2020             Low
2    2                                       IT\nSecurity                  Unauthor\nised\nchanges   ...  a) Impleme\nnting of\nEDR is in\nprogress,\nta...                    Jul-\n2020             Low
3    3                                       IT\nSecurity                          Insider\nThreat   ...  a) Evaluatio\nn of user\nbehavior\nanalytic\nt...                    Jul-\n2020             Low
4    4                                       IT\nSecurity      Unauthor\nised\nLateral\nmovemen\nt   ...  a) Impleme\nnting of\nEDR is in\nprogress,\nta...                    Jul-\n2020             Low
5    5          Ensure\n24*7\navailabilit\ny of the\nsite   Malware\npropagati\non in the\nnetwork   ...  a) NGFW\nhas been\ncomplet\ned in 3Q\n2019\nb)...                    Jul-\n2020             Low
6     6  To ensure\ntimely\ndetection\nand\neffective\n...                Cyber\nIncident\nHandling  ...  a) Conduct\nMSOC\nreadines\ns\nassessm\nent by...                     Jul-\n2020             Low
7    7                                 Risk\nManage\nment     Vulnerabi\nlity & Risk\nManagem\nent   ...  a) Updating\nof asset\ninventor\ny list as\npe...                    Jul-\n2020             Low

[8 rows x 15 columns]

bdthanh Jul 24, 2023
Author

It seems to work, I need to test with more test cases and get back to you

bdthanh Aug 2, 2023
Author

Yes, it does work. Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrectly recognize borderlines with different weights #941

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Incorrectly recognize borderlines with different weights #941

bdthanh Jul 21, 2023

Replies: 1 comment · 4 replies

cmdlineluser Jul 21, 2023

bdthanh Jul 22, 2023 Author

cmdlineluser Jul 22, 2023

bdthanh Jul 24, 2023 Author

bdthanh Aug 2, 2023 Author

bdthanh
Jul 21, 2023

Replies: 1 comment 4 replies

cmdlineluser
Jul 21, 2023

bdthanh Jul 22, 2023
Author

bdthanh Jul 24, 2023
Author

bdthanh Aug 2, 2023
Author