Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/pymupdf experiment #368

Closed
wants to merge 13 commits into from
Closed

Feat/pymupdf experiment #368

wants to merge 13 commits into from

Conversation

adamjanovsky
Copy link
Collaborator

This closes #364

@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented Oct 18, 2023

@J08nY The datasets computed with pdftotext and pymupdf are available at Aura, sitting at /var/tmp/xjanovsk/certs/sec-certs/dataset/toy_dataset_100_certs; you should have read access there.

Performance wise, the processing speed seems worse for mupypdf, but I guess we don't care that much.

⚠️ EDIT: There's apparently some bug in pymupdf processing, please don't investigate the comparison until @dmacko232 fixes that.

@adamjanovsky
Copy link
Collaborator Author

@dmacko232 Do we know what are the internal dependencies of pymupdf package? Could we drop dependency on poppler if we make a switch?

Also, we're scanning some tables in FIPS documents with some java tool. Could we get rid of the java dependency as well?

@dmacko232
Copy link
Collaborator

@adamjanovsky Poppler is not dependency. The java thing should not be dependency either I guess in case of pymupdf.

@adamjanovsky
Copy link
Collaborator Author

Closing this. The details of what would it take to get pymupdf surpass pdftotext in terms of output quality are described in #364

@adamjanovsky adamjanovsky deleted the feat/pymupdf_experiment branch November 14, 2023 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evaluate possible switch from pdftotext to PyMuPDF
3 participants