Concatenating cropped page objects #1133

Phylanxy · 2024-04-25T17:20:01Z

Phylanxy
Apr 25, 2024

Hi there,

I want to extract only the main text from a fairy large PDF with headers and footers (these I don't want). I've actually achieved my goal thanks to this post on Stackoverflow.
But obviously I've ran into a new problem. The example above only reads one page and when trying to loop over several pages, I realized I couldn't save the cropped page object in a variable. I was only able to save the text as a string, but I'd like to retain the information of the page object (such as char["size"]) to use for filtering later.

Is there a way to do this or do I need to construct a new container for this info?

Here is my code:

import collections
import pdfplumber as pdfplumber

def find_text_parts_on_page(page):
    """
    Idea: separate text by font sizes, rank them by popularity.
    The most popular text size is most likely the main text.
    The second most popular text size is most likely the footnote.
    However, we check which of the two most popular text sizes is larger (by font size).
    We pick the larger one as the main text and the smaller one as the footnote.
    We could also use the vertical position of the bounding box to determine that.
    """

    font_sizes = collections.Counter()
    bounding_boxes = {}

    for char in page.chars:
        size_key = char["size"]
        font_sizes[size_key] += 1

        if size_key not in bounding_boxes:
            bounding_boxes[size_key] = [char["x0"], char["top"], char["x1"], char["bottom"]]
        else:
            if char["x0"] < bounding_boxes[size_key][0]:
                bounding_boxes[size_key][0] = char["x0"]
            if char["top"] < bounding_boxes[size_key][1]:
                bounding_boxes[size_key][1] = char["top"]
            if char["x1"] > bounding_boxes[size_key][2]:
                bounding_boxes[size_key][2] = char["x1"]
            if char["bottom"] > bounding_boxes[size_key][3]:
                bounding_boxes[size_key][3] = char["bottom"]

    most_common_sizes = font_sizes.most_common(2)
    
    # The main box has larger text size than the footnote box
      # The main box has larger text size than the footnote box
    first = most_common_sizes[0][0], bounding_boxes[most_common_sizes[0][0]]
    second = most_common_sizes[1][0], bounding_boxes[most_common_sizes[1][0]]

    if first[0] > second[0]:
        return first, second
    else:
        return second, first

This is the working version that extracts the main text only:

import pprint
with pdfplumber.open("test.pdf") as pdf:
    main_part = []
    for page in pdf.pages:
        [main_size, main_box], [footnote_size, footnote_box] = find_text_parts_on_page(page)
        main_part.append(page.within_bbox(main_box).extract_text())
pprint.pprint(main_part)

This is my try at saving the page object for later processing:

with pdfplumber.open("test.pdf") as pdf:
    main_part = {}
    secondary_part = {}
    for page in pdf.pages:
        [main_size, main_box], [footnote_size, footnote_box] = find_text_parts_on_page(page)
        main_part[page] = page.within_bbox(main_box)
        secondary_part[page] = page.within_bbox(main_box)
        print("--------original-------- \n", page.extract_text())
        print("--------cropped-------- \n", main_part[page].extract_text())
for part in main_part:
    print(part.extract_text())

This is the test file I'm working with, it only contains a few pages of the original document: test.pdf

Is there a way to do this or is this due to the fact that pdfplumber isn't build to modify PDF files?

Thanks in advance :)

jsvine · 2024-05-15T17:50:25Z

jsvine
May 15, 2024
Maintainer

Hi @Phylanxy, and thanks for your interest in pdfplumber. You can, indeed, retain a reference to the cropped page, as you do with main_part[page] = page.within_bbox(main_box) in your last code block. And that code block seems to run without error. So perhaps I'm misunderstanding: What is the problem you're trying to solve?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenating cropped page objects #1133

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Concatenating cropped page objects #1133

Phylanxy Apr 25, 2024

Replies: 1 comment

jsvine May 15, 2024 Maintainer

Phylanxy
Apr 25, 2024

jsvine
May 15, 2024
Maintainer