Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: return page number of pdf documents upon retrieval #7749

Merged
merged 3 commits into from
Sep 5, 2024
Merged

feat: return page number of pdf documents upon retrieval #7749

merged 3 commits into from
Sep 5, 2024

Conversation

jasonkang14
Copy link
Contributor

@jasonkang14 jasonkang14 commented Aug 28, 2024

Checklist:

Important

Please review the checklist below before submitting your pull request.

  • Please open an issue before creating a PR or link to an existing issue
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

Description

Fixes

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update, included: Dify Document
  • Improvement, including but not limited to code refactoring, performance optimization, and UI/UX improvement
  • Dependency upgrade

Testing Instructions

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Tested on different local files besides pdf and it works
  • page number is returned when querying a pdf file
Screenshot 2024-08-28 at 4 00 40 PM

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. 🐍 python 💪 enhancement New feature or request labels Aug 28, 2024
@crazywoola crazywoola self-requested a review August 28, 2024 08:49
@crazywoola crazywoola self-assigned this Aug 28, 2024
@crazywoola
Copy link
Member

I tried several PDFs but always get the page number as null.

@jasonkang14
Copy link
Contributor Author

@crazywoola i will check again.

@jasonkang14
Copy link
Contributor Author

jasonkang14 commented Aug 29, 2024

@crazywoola i think its because not all vector stores save page numbers. can i change those in this PR as well?

@jasonkang14
Copy link
Contributor Author

@crazywoola it's a bug with weaviate. i have tried other vector stores and it retrieves page numbers. and i have found a bug with milvus so will be making another PR for each

@crazywoola crazywoola mentioned this pull request Sep 4, 2024
5 tasks
@crazywoola
Copy link
Member

@crazywoola it's a bug with weaviate. i have tried other vector stores and it retrieves page numbers. and i have found a bug with milvus so will be making another PR for each

@jasonkang14 You can fix the weaviate db in this PR, and you can open another one to fix others. :)

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Sep 5, 2024
@jasonkang14
Copy link
Contributor Author

@crazywoola I have found a better way to resolve this. It turned out that other vector stores were actually returning page which was not included in the attributes in vector_factory.py now the "weaviate bug" has been fixed

Screenshot 2024-09-05 at 2 52 42 PM

Copy link
Member

@crazywoola crazywoola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I got the page number from the doc. :)

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 5, 2024
@crazywoola crazywoola merged commit d489b8b into langgenius:main Sep 5, 2024
6 checks passed
@jasonkang14 jasonkang14 deleted the feature/add_page_number_to_retrieval branch September 5, 2024 09:15
mehrajagdish pushed a commit to Sbazar-GmbH/dify that referenced this pull request Sep 6, 2024
@ifsheldon
Copy link

ifsheldon commented Sep 10, 2024

@crazywoola @jasonkang14 This silently breaks my existing Dify that is built from the older source code. My Dify was in use prior to this commit, so the existing weaviate DB does not have page field.

When I updated the source code which includes this commit and built and run it again, I got a weird error stating something like

werkzeug.exceptions.InternalServerError: 500 Internal Server Error: Error during query: [{'locations': [{'column': 155, 'line': 1}], 'message': 'Cannot query field "page" on type "Vector_index_50fde6d5_97eb_4201_8048_c9311f9b8183_Node".', 'path': None}];
Error during query: [{'locations': [{'column': 17082, 'line': 1}], 'message': 'Cannot query field "page" on type "Vector_index_50fde6d5_97eb_4201_8048_c9311f9b8183_Node".', 'path': None}]

The error is caught high up in the stack, so it took me a day to locate the exception.

The exception is caught here:

The breaking change is here:

attributes = ['doc_id', 'dataset_id', 'document_id', 'doc_hash', 'page']

The exception is raised here:

I think this is actually a breaking change. Probably you should advise on how to do a DB migration?

@jasonkang14
Copy link
Contributor Author

@ifsheldon On my end, a database migration was not necessary. Dify automatically stores the extracted text from PDFs into the vector store. The recent update simply ensures that the page is now included in the retrieval process. check out the source code below

metadata = {"source": blob.source, "page": page_number}

@ifsheldon
Copy link

ifsheldon commented Sep 10, 2024

@jasonkang14 The problem, I guess, is if the content of a knowledgebase comes from pure text, then the metadata will not contain page field. However, you changed the default attributes to search, adding page into attributes. Then, the knowledgebase backed by weaviate will throw an exception when performing a search for these attributes(including page). You can try simply creating a knowledgebase from a txt file and making a hit test.

@crazywoola crazywoola mentioned this pull request Sep 10, 2024
5 tasks
lau-td pushed a commit to heydevs-io/dify that referenced this pull request Oct 23, 2024
idonotknow pushed a commit to AceDataCloud/Dify that referenced this pull request Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 enhancement New feature or request lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add page number to the segments Include Original Document Page Numbers in Retrieval Results
3 participants