feat: `PyPDFToDocument` - add new customization parameters #8574

anakin87 · 2024-11-22T17:13:15Z

Related Issues

part of PyPDFToDocument: make conversion customization easier for users #8553

Proposed Changes:

Add new initialization parameters to PyPDFToDocument to customize the text extraction process from PDF files.
These parameter won't be used if a custom converter is provided (it will be deprecated in chore:PyPDFToDocument - deprecate converter init parameter #8569)

How did you test it?

CI, new test

Notes for the reviewer

I don't particularly like the addition of all of these new init parameters, but we have already discussed and discarded the idea of having a single extraction_kwargs dict.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

…te-custom-converter

…/deepset-ai/haystack into pypdf-deprecate-custom-converter

coveralls · 2024-11-26T13:55:39Z

Pull Request Test Coverage Report for Build 12033758738

Details

0 of 0 changed or added relevant lines in 0 files are covered.
6 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.03%) to 90.349%

Files with Coverage Reduction	New Missed Lines	%
components/converters/pypdf.py	6	93.18%

Totals
Change from base Build 12031967571:	0.03%
Covered Lines:	8014
Relevant Lines:	8870

💛 - Coveralls

vblagoje

Seems fine only minor nitpicking comments.
Maintaining compatibility on this one is going to be hard with so many fields. Any other way to deal with these in less breaking manner? Perhaps some config dictionary or something....

haystack/components/converters/pypdf.py

anakin87 · 2024-11-26T15:30:30Z

Perhaps some config dictionary or something....

This was my original idea but then we discarded it to keep more control on accepted parameters and serialization.

anakin87 added 8 commits November 22, 2024 13:08

deprecat converter in pypdf

9a6d709

fix linting of MetaFieldGroupingRanker

438638e

linting

88e7200

Merge branch 'fix-lint-meta-field-grouping-ranker' into pypdf-depreca…

466eafd

…te-custom-converter

Merge branch 'main' into pypdf-deprecate-custom-converter

1ca6163

Merge branch 'main' into pypdf-deprecate-custom-converter

25aafe5

Merge branch 'pypdf-deprecate-custom-converter' of https://github.com…

c9b5072

…/deepset-ai/haystack into pypdf-deprecate-custom-converter

pypdftodocument: add customization params

03e2f6f

github-actions bot added type:documentation Improvements on the docs topic:tests labels Nov 22, 2024

fix mypy

f6636e2

anakin87 marked this pull request as ready for review November 22, 2024 17:31

anakin87 requested review from a team as code owners November 22, 2024 17:31

anakin87 requested review from dfokina and vblagoje and removed request for a team November 22, 2024 17:31

anakin87 added this to the 2.8.0 milestone Nov 22, 2024

Base automatically changed from pypdf-deprecate-custom-converter to main November 26, 2024 13:47

Merge branch 'main' into pypdf-add-customization-params

5b0507a

vblagoje approved these changes Nov 26, 2024

View reviewed changes

haystack/components/converters/pypdf.py Outdated Show resolved Hide resolved

haystack/components/converters/pypdf.py Outdated Show resolved Hide resolved

incorporate feedback

9116031

anakin87 merged commit fb42c03 into main Nov 26, 2024
18 checks passed

anakin87 deleted the pypdf-add-customization-params branch November 26, 2024 15:38

anakin87 mentioned this pull request Nov 26, 2024

PyPDFToDocument: make conversion customization easier for users #8553

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `PyPDFToDocument` - add new customization parameters #8574

feat: `PyPDFToDocument` - add new customization parameters #8574

anakin87 commented Nov 22, 2024 •

edited

Loading

coveralls commented Nov 26, 2024 •

edited

Loading

vblagoje left a comment

anakin87 commented Nov 26, 2024

feat: PyPDFToDocument - add new customization parameters #8574

feat: PyPDFToDocument - add new customization parameters #8574

Conversation

anakin87 commented Nov 22, 2024 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

coveralls commented Nov 26, 2024 • edited Loading

Pull Request Test Coverage Report for Build 12033758738

Details

💛 - Coveralls

vblagoje left a comment

Choose a reason for hiding this comment

anakin87 commented Nov 26, 2024

feat: `PyPDFToDocument` - add new customization parameters #8574

feat: `PyPDFToDocument` - add new customization parameters #8574

anakin87 commented Nov 22, 2024 •

edited

Loading

coveralls commented Nov 26, 2024 •

edited

Loading