-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open parse seems missing some blocks within pdf file #40
Comments
The default processing pipeline skips small blocks. I'm using this pipeline with good results:
|
Thank you Ingr! it looks fine now.
|
@DinoLiww max area controls the maximum size of an element. In this case it's based on a percentage of page size. If you don't have elements that are getting dropped, then it's because there's no elements that take up more than 10% (0.1) of a page. This is mostly used to filter out pages that have massive text (like a title) that aren't helpful in a RAG pipeline. As for your second point, can you eleborate? |
Initial Checks
Description
Hi there,
Thanks for your open parse 1st and it looks cool in most of the time.
But when I try to bring my real world tasks into OP and it seems some problems come up.
when I run openparse_quickstart.ipynb to parse some pdf files as attached, PO actually.
it seems open parse missing some blocks within the pdf files.
Please kindly let me know how to move.
Thanks!
Dino
ASE.PDF
Amkor.PDF
Example Code
Python, open-parse & OS Version
The text was updated successfully, but these errors were encountered: