-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: return page number of pdf documents upon retrieval #7749
feat: return page number of pdf documents upon retrieval #7749
Conversation
I tried several PDFs but always get the page number as |
@crazywoola i will check again. |
@crazywoola i think its because not all vector stores save page numbers. can i change those in this PR as well? |
@crazywoola it's a bug with weaviate. i have tried other vector stores and it retrieves page numbers. and i have found a bug with milvus so will be making another PR for each |
@jasonkang14 You can fix the weaviate db in this PR, and you can open another one to fix others. :) |
@crazywoola I have found a better way to resolve this. It turned out that other vector stores were actually returning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I got the page number from the doc. :)
@crazywoola @jasonkang14 This silently breaks my existing Dify that is built from the older source code. My Dify was in use prior to this commit, so the existing weaviate DB does not have When I updated the source code which includes this commit and built and run it again, I got a weird error stating something like
The error is caught high up in the stack, so it took me a day to locate the exception. The exception is caught here:
The breaking change is here:
The exception is raised here:
I think this is actually a breaking change. Probably you should advise on how to do a DB migration? |
@ifsheldon On my end, a database migration was not necessary. Dify automatically stores the extracted text from PDFs into the vector store. The recent update simply ensures that the page is now included in the retrieval process. check out the source code below
|
@jasonkang14 The problem, I guess, is if the content of a knowledgebase comes from pure text, then the metadata will not contain |
Checklist:
Important
Please review the checklist below before submitting your pull request.
dev/reformat
(backend) andcd web && npx lint-staged
(frontend) to appease the lint godsDescription
None
ifpage
is not in the knowledge baseFixes
Type of Change
Testing Instructions
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration