Document joiner tutorial - input prompt to LLM with no whitespaces and mixed contents #362

aleflabo · 2024-11-13T16:48:03Z

I'm looking at the input prompt to the LLM of the tutorial about document joiner, and the output prompt has lots of words with no whitespaces between them. Is the output of the LLM affected? Do you know how I can solve this issue? I tried using the ollama embedder, and the output had the same problem.
The documents used are the ones downloaded in the tutorial: one .txt, one .pdf and one .md

anakin87 · 2024-11-14T09:03:11Z

Thanks for reporting the issue.
I'll have a look in the next few days.

anakin87 · 2024-11-18T11:57:40Z

This PDF is somewhat strange, so currently the only way to properly extract the text is via a custom Converter.

from pypdf import PdfReader
from haystack import Document, default_to_dict, default_from_dict

class CustomConverter:
    def convert(self, reader: "PdfReader") -> Document:
        """Extract text from the PDF and return a Document object with the text content."""
        text = "\f".join(page.extract_text(extraction_mode="layout") for page in reader.pages)
        return Document(content=text)

    def to_dict(self):
        """Serialize the converter to a dictionary."""
        return default_to_dict(self)

    @classmethod
    def from_dict(cls, data):
        """Deserialize the converter from a dictionary."""
        return default_from_dict(cls, data)

pdf_converter = PyPDFToDocument(converter=CustomConverter())

I will do the following:

use a better PDF in the tutorial
investigate simpler ways for users to customize PDF conversion - PyPDFToDocument: make conversion customization easier for users haystack#8553

anakin87 · 2024-11-18T12:29:05Z

I'm closing this issue and moving the discussion of simpler ways for users to customize PDF conversion to deepset-ai/haystack#8553

aleflabo added the bug Something isn't working label Nov 13, 2024

anakin87 self-assigned this Nov 14, 2024

anakin87 mentioned this issue Nov 18, 2024

PyPDFToDocument: make conversion customization easier for users deepset-ai/haystack#8553

Closed

anakin87 closed this as completed Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document joiner tutorial - input prompt to LLM with no whitespaces and mixed contents #362

Document joiner tutorial - input prompt to LLM with no whitespaces and mixed contents #362

aleflabo commented Nov 13, 2024

anakin87 commented Nov 14, 2024

anakin87 commented Nov 18, 2024 •

edited

Loading

anakin87 commented Nov 18, 2024

Document joiner tutorial - input prompt to LLM with no whitespaces and mixed contents #362

Document joiner tutorial - input prompt to LLM with no whitespaces and mixed contents #362

Comments

aleflabo commented Nov 13, 2024

anakin87 commented Nov 14, 2024

anakin87 commented Nov 18, 2024 • edited Loading

anakin87 commented Nov 18, 2024

anakin87 commented Nov 18, 2024 •

edited

Loading