`PyPDFToDocument`: make conversion customization easier for users #8553

anakin87 · 2024-11-18T12:26:46Z

Is your feature request related to a problem? Please describe.
This stemmed from deepset-ai/haystack-tutorials#362.

Currently, to customize the PDF conversion process, the user has to provide a custom Converter (adhering to PyPDFConverter protocol).

While this allows great flexibility, it requires considerable effort for users who wish to customize only one extraction parameter (for example, extraction_mode). PyPDF extraction parameters

Describe the solution you'd like
It would be nice to provide a easier way to do simple customizations.

Initially, I thought of allowing to pass extraction_kwargs in __init__ and also include them in the PyPDFConverter protocol.

@silvanocerza proposed another idea: create something like a CustomConverter implementation (adhering to PyPDFConverter protocol) and make it possible for users to use it in a simple way.
Something like:

from haystack.components.converters.pypdf import PyPDFToDocument, CustomConverter

custom_converter = CustomConverter(extraction_mode="layout")

pypdf_to_document = PyPDFToDocument(converter=custom_converter)

pypdf_to_document.run(...)

We would like to get @shadeMe's opinion on this...

The text was updated successfully, but these errors were encountered:

anakin87 · 2024-11-22T11:43:15Z

Passing converter allows almost unlimited customization, but I think this is rarely used and I argue that it would be easier for the user to create a custom component.

After an internal discussion, we decided to do the following:

Deprecate the ability of passing a converter object at init time (and tell users to create a custom component if they need very specific customization). Remove this feature in future.
Expand the customizability of PyPDFToDocument. We will follow the approach of PDFMinerToDocument, exposing a selected list of PyPDF parameters, chosen and serialized by us.

anakin87 · 2024-11-26T15:39:35Z

done in #8569 and #8574

anakin87 added topic:preprocessing pdf labels Nov 18, 2024

anakin87 mentioned this issue Nov 18, 2024

Document joiner tutorial - input prompt to LLM with no whitespaces and mixed contents deepset-ai/haystack-tutorials#362

Closed

anakin87 self-assigned this Nov 21, 2024

anakin87 added the P2 Medium priority, add to the next sprint if no P1 available label Nov 21, 2024

This was referenced Nov 22, 2024

chore:PyPDFToDocument - deprecate converter init parameter #8569

Merged

feat: PyPDFToDocument - add new customization parameters #8574

Merged

anakin87 closed this as completed Nov 26, 2024

anakin87 mentioned this issue Nov 26, 2024

PyPDFToDocument - remove deprecated converter init parameter #8586

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PyPDFToDocument`: make conversion customization easier for users #8553

`PyPDFToDocument`: make conversion customization easier for users #8553

anakin87 commented Nov 18, 2024

anakin87 commented Nov 22, 2024

anakin87 commented Nov 26, 2024

PyPDFToDocument: make conversion customization easier for users #8553

PyPDFToDocument: make conversion customization easier for users #8553

Comments

anakin87 commented Nov 18, 2024

anakin87 commented Nov 22, 2024

anakin87 commented Nov 26, 2024

`PyPDFToDocument`: make conversion customization easier for users #8553

`PyPDFToDocument`: make conversion customization easier for users #8553