Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPDFToDocument: make conversion customization easier for users #8553

Closed
anakin87 opened this issue Nov 18, 2024 · 2 comments
Closed

PyPDFToDocument: make conversion customization easier for users #8553

anakin87 opened this issue Nov 18, 2024 · 2 comments
Assignees
Labels
P2 Medium priority, add to the next sprint if no P1 available pdf topic:preprocessing

Comments

@anakin87
Copy link
Member

Is your feature request related to a problem? Please describe.
This stemmed from deepset-ai/haystack-tutorials#362.

Currently, to customize the PDF conversion process, the user has to provide a custom Converter (adhering to PyPDFConverter protocol).

While this allows great flexibility, it requires considerable effort for users who wish to customize only one extraction parameter (for example, extraction_mode). PyPDF extraction parameters

Describe the solution you'd like
It would be nice to provide a easier way to do simple customizations.

  • Initially, I thought of allowing to pass extraction_kwargs in __init__ and also include them in the PyPDFConverter protocol.
  • @silvanocerza proposed another idea: create something like a CustomConverter implementation (adhering to PyPDFConverter protocol) and make it possible for users to use it in a simple way.
    Something like:
    from haystack.components.converters.pypdf import PyPDFToDocument, CustomConverter
    
    custom_converter = CustomConverter(extraction_mode="layout")
    
    pypdf_to_document = PyPDFToDocument(converter=custom_converter)
    
    pypdf_to_document.run(...)

We would like to get @shadeMe's opinion on this...

@anakin87
Copy link
Member Author

Passing converter allows almost unlimited customization, but I think this is rarely used and I argue that it would be easier for the user to create a custom component.

After an internal discussion, we decided to do the following:

  • Deprecate the ability of passing a converter object at init time (and tell users to create a custom component if they need very specific customization). Remove this feature in future.
  • Expand the customizability of PyPDFToDocument. We will follow the approach of PDFMinerToDocument, exposing a selected list of PyPDF parameters, chosen and serialized by us.

@anakin87
Copy link
Member Author

done in #8569 and #8574

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available pdf topic:preprocessing
Projects
None yet
Development

No branches or pull requests

1 participant