Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document joiner tutorial - input prompt to LLM with no whitespaces and mixed contents #362

Closed
aleflabo opened this issue Nov 13, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@aleflabo
Copy link

I'm looking at the input prompt to the LLM of the tutorial about document joiner, and the output prompt has lots of words with no whitespaces between them. Is the output of the LLM affected? Do you know how I can solve this issue? I tried using the ollama embedder, and the output had the same problem.
The documents used are the ones downloaded in the tutorial: one .txt, one .pdf and one .md

image

@aleflabo aleflabo added the bug Something isn't working label Nov 13, 2024
@anakin87 anakin87 self-assigned this Nov 14, 2024
@anakin87
Copy link
Member

Thanks for reporting the issue.
I'll have a look in the next few days.

@anakin87
Copy link
Member

anakin87 commented Nov 18, 2024

This PDF is somewhat strange, so currently the only way to properly extract the text is via a custom Converter.

from pypdf import PdfReader
from haystack import Document, default_to_dict, default_from_dict

class CustomConverter:
    def convert(self, reader: "PdfReader") -> Document:
        """Extract text from the PDF and return a Document object with the text content."""
        text = "\f".join(page.extract_text(extraction_mode="layout") for page in reader.pages)
        return Document(content=text)

    def to_dict(self):
        """Serialize the converter to a dictionary."""
        return default_to_dict(self)

    @classmethod
    def from_dict(cls, data):
        """Deserialize the converter from a dictionary."""
        return default_from_dict(cls, data)

pdf_converter = PyPDFToDocument(converter=CustomConverter())

I will do the following:

@anakin87
Copy link
Member

I'm closing this issue and moving the discussion of simpler ways for users to customize PDF conversion to deepset-ai/haystack#8553

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants