Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gen-Ai-Orchestrator] Add title to embedded documents for better retrieval #1639

Conversation

GuirriecP
Copy link
Contributor

@GuirriecP GuirriecP commented Jun 18, 2024

Adds the 'Title' entry of input CSV to embedded documents:

  • during retrieval, the list of retrieved documents is more deterministic
  • (seems to) slightly improves the quality of the generated answer
  • fix the 'Text: ' préfix present in first chunks of source documents.

Fixes #1638

@vsct-jburet vsct-jburet added this to the 24.3.2 milestone Jun 18, 2024
@vsct-jburet
Copy link
Contributor

@GuirriecP is it still draft ?

@GuirriecP
Copy link
Contributor Author

@vsct-jburet Yes it is. We will draft PRs to tell the Arkéa team members that the work is done, pending review. We will only push final, mergeable PRs to the repo when they have been reviewed.

@vsct-jburet vsct-jburet modified the milestones: 24.3.2, 24.3.3 Jun 18, 2024
@vsct-jburet vsct-jburet modified the milestones: 24.3.3, 24.3.4 Jun 27, 2024
"""Add 'title' from metadata to Document's page_content for better semantic search."""
for doc in splitted_docs:
# Store the original page_content in the metadata
doc.metadata['original_text'] = doc.page_content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will multiple by 2 the weight of a database from the previous implementation for small DB it's not an issue for bigger ones this will have consequent storage (and cost) impact.

@vsct-jburet vsct-jburet modified the milestones: 24.3.4, 24.3.5 Jul 11, 2024
@vsct-jburet vsct-jburet modified the milestones: 24.3.5, 23.4.6 Aug 20, 2024
@Benvii
Copy link
Member

Benvii commented Sep 2, 2024

@assouktim were did you fix this PR ? is there a new branch somewhere ? thanks in advance

@assouktim
Copy link
Contributor

Replaced by #1732

@assouktim assouktim closed this Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Gen-Ai-Orchestrator] Documents' 'Title' shall be used for semantic retrieval
4 participants