Replies: 1 comment
-
Hi @Phylanxy, and thanks for your interest in |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there,
I want to extract only the main text from a fairy large PDF with headers and footers (these I don't want). I've actually achieved my goal thanks to this post on Stackoverflow.
But obviously I've ran into a new problem. The example above only reads one page and when trying to loop over several pages, I realized I couldn't save the cropped page object in a variable. I was only able to save the text as a string, but I'd like to retain the information of the page object (such as char["size"]) to use for filtering later.
Is there a way to do this or do I need to construct a new container for this info?
Here is my code:
This is the working version that extracts the main text only:
This is my try at saving the page object for later processing:
This is the test file I'm working with, it only contains a few pages of the original document: test.pdf
Is there a way to do this or is this due to the fact that pdfplumber isn't build to modify PDF files?
Thanks in advance :)
Beta Was this translation helpful? Give feedback.
All reactions