feat: return page number of pdf documents upon retrieval #7749

jasonkang14 · 2024-08-28T07:05:27Z

Checklist:

Important

Please review the checklist below before submitting your pull request.

Please open an issue before creating a PR or link to an existing issue
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

Description

Return the original page number of an uploaded pdf file when retrieving knowledge.
- the value will be None if page is not in the knowledge base
Closes Include Original Document Page Numbers in Retrieval Results #7738
Closes Add page number to the segments #7944

Fixes

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update, included: Dify Document
Improvement, including but not limited to code refactoring, performance optimization, and UI/UX improvement
Dependency upgrade

Testing Instructions

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Tested on different local files besides pdf and it works
page number is returned when querying a pdf file

crazywoola · 2024-08-29T05:51:43Z

I tried several PDFs but always get the page number as null.

jasonkang14 · 2024-08-29T05:56:09Z

@crazywoola i will check again.

jasonkang14 · 2024-08-29T06:10:14Z

@crazywoola i think its because not all vector stores save page numbers. can i change those in this PR as well?

jasonkang14 · 2024-08-29T07:25:22Z

@crazywoola it's a bug with weaviate. i have tried other vector stores and it retrieves page numbers. and i have found a bug with milvus so will be making another PR for each

crazywoola · 2024-09-04T01:02:53Z

@crazywoola it's a bug with weaviate. i have tried other vector stores and it retrieves page numbers. and i have found a bug with milvus so will be making another PR for each

@jasonkang14 You can fix the weaviate db in this PR, and you can open another one to fix others. :)

jasonkang14 · 2024-09-05T05:54:41Z

@crazywoola I have found a better way to resolve this. It turned out that other vector stores were actually returning page which was not included in the attributes in vector_factory.py now the "weaviate bug" has been fixed

crazywoola

Cool, I got the page number from the doc. :)

…7749)

ifsheldon · 2024-09-10T06:46:39Z

@crazywoola @jasonkang14 This silently breaks my existing Dify that is built from the older source code. My Dify was in use prior to this commit, so the existing weaviate DB does not have page field.

When I updated the source code which includes this commit and built and run it again, I got a weird error stating something like

werkzeug.exceptions.InternalServerError: 500 Internal Server Error: Error during query: [{'locations': [{'column': 155, 'line': 1}], 'message': 'Cannot query field "page" on type "Vector_index_50fde6d5_97eb_4201_8048_c9311f9b8183_Node".', 'path': None}];
Error during query: [{'locations': [{'column': 17082, 'line': 1}], 'message': 'Cannot query field "page" on type "Vector_index_50fde6d5_97eb_4201_8048_c9311f9b8183_Node".', 'path': None}]

The error is caught high up in the stack, so it took me a day to locate the exception.

The exception is caught here:

dify/api/controllers/console/datasets/hit_testing.py

Line 81 in 86f7f24

except Exception as e:

The breaking change is here:

dify/api/core/rag/datasource/vdb/vector_factory.py

Line 33 in 86f7f24

attributes = ['doc_id', 'dataset_id', 'document_id', 'doc_hash', 'page']

The exception is raised here:

dify/api/core/rag/datasource/vdb/weaviate/weaviate_vector.py

Line 197 in 86f7f24

result = (

I think this is actually a breaking change. Probably you should advise on how to do a DB migration?

jasonkang14 · 2024-09-10T06:59:34Z

@ifsheldon On my end, a database migration was not necessary. Dify automatically stores the extracted text from PDFs into the vector store. The recent update simply ensures that the page is now included in the retrieval process. check out the source code below

dify/api/core/rag/extractor/pdf_extractor.py

Line 69 in af92f19

metadata = {"source": blob.source, "page": page_number}

ifsheldon · 2024-09-10T07:06:18Z

@jasonkang14 The problem, I guess, is if the content of a knowledgebase comes from pure text, then the metadata will not contain page field. However, you changed the default attributes to search, adding page into attributes. Then, the knowledgebase backed by weaviate will throw an exception when performing a search for these attributes(including page). You can try simply creating a knowledgebase from a txt file and making a hit test.

…7749)

jasonkang14 added 2 commits August 28, 2024 16:01

add page number of original documents for knowledge retrieval

7b64cb9

fix: lint

7ea3809

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. 🐍 python 💪 enhancement New feature or request labels Aug 28, 2024

crazywoola self-requested a review August 28, 2024 08:49

crazywoola self-assigned this Aug 28, 2024

crazywoola mentioned this pull request Sep 4, 2024

Add page number to the segments #7944

Closed

5 tasks

feat: add page to vector factory metadata attribute

baec2b8

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Sep 5, 2024

crazywoola approved these changes Sep 5, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 5, 2024

crazywoola merged commit d489b8b into langgenius:main Sep 5, 2024
6 checks passed

jasonkang14 deleted the feature/add_page_number_to_retrieval branch September 5, 2024 09:15

mehrajagdish pushed a commit to Sbazar-GmbH/dify that referenced this pull request Sep 6, 2024

feat: return page number of pdf documents upon retrieval (langgenius#…

e3ad044

…7749)

crazywoola mentioned this pull request Sep 10, 2024

datasets query error #8211

Closed

5 tasks

fniu mentioned this pull request Sep 17, 2024

PDF page number is absent in knowledge retrieval #8502

Open

5 tasks

cuiks pushed a commit to cuiks/dify that referenced this pull request Sep 26, 2024

feat: return page number of pdf documents upon retrieval (langgenius#…

30e4d17

…7749)

lau-td pushed a commit to heydevs-io/dify that referenced this pull request Oct 23, 2024

feat: return page number of pdf documents upon retrieval (langgenius#…

8e04b42

…7749)

idonotknow pushed a commit to AceDataCloud/Dify that referenced this pull request Nov 16, 2024

feat: return page number of pdf documents upon retrieval (langgenius#…

df5bdb3

…7749)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: return page number of pdf documents upon retrieval #7749

feat: return page number of pdf documents upon retrieval #7749

jasonkang14 commented Aug 28, 2024 •

edited by crazywoola

Loading

crazywoola commented Aug 29, 2024

jasonkang14 commented Aug 29, 2024

jasonkang14 commented Aug 29, 2024 •

edited

Loading

jasonkang14 commented Aug 29, 2024

crazywoola commented Sep 4, 2024

jasonkang14 commented Sep 5, 2024

crazywoola left a comment

ifsheldon commented Sep 10, 2024 •

edited

Loading

jasonkang14 commented Sep 10, 2024

ifsheldon commented Sep 10, 2024 •

edited

Loading

feat: return page number of pdf documents upon retrieval #7749

feat: return page number of pdf documents upon retrieval #7749

Conversation

jasonkang14 commented Aug 28, 2024 • edited by crazywoola Loading

Checklist:

Description

Type of Change

Testing Instructions

crazywoola commented Aug 29, 2024

jasonkang14 commented Aug 29, 2024

jasonkang14 commented Aug 29, 2024 • edited Loading

jasonkang14 commented Aug 29, 2024

crazywoola commented Sep 4, 2024

jasonkang14 commented Sep 5, 2024

crazywoola left a comment

Choose a reason for hiding this comment

ifsheldon commented Sep 10, 2024 • edited Loading

jasonkang14 commented Sep 10, 2024

ifsheldon commented Sep 10, 2024 • edited Loading

jasonkang14 commented Aug 28, 2024 •

edited by crazywoola

Loading

jasonkang14 commented Aug 29, 2024 •

edited

Loading

ifsheldon commented Sep 10, 2024 •

edited

Loading

ifsheldon commented Sep 10, 2024 •

edited

Loading