Improved indexing #3594

pablonyx · 2025-01-04T19:36:52Z

Description

For indexing:

Document: new value chunk_count
If no chunk_count, we can just assume that it has the old UUID system (including non tenancy):
- If large chunks enabled, generate the UUID based on the old UUID logic + large chunk reference IDs
If chunk_count, can infer UUIDs + exact values for deletion

How Has This Been Tested?

In multi tenant and single-tenant scenario:

Index various docs
Increase size of doc
Decrease size of doc
Ensure proper doc chunk IDs are deleted in Vespa
Do the above specifically when migrating from main to this branch

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)

vercel · 2025-01-04T19:36:56Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
internal-search	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 5, 2025 11:00pm

rkuo-danswer · 2025-01-05T11:09:38Z

backend/onyx/db/document.py

+    document_id_to_current_chunks_indexed: dict[str, int],
+    db_session: Session,
+) -> None:
+    documents_to_update = (


are document_id's and document_id_to_current_chunks_indexed the same docs? what's the difference?

rkuo-danswer · 2025-01-05T11:14:50Z

backend/onyx/document_index/vespa/indexing_utils.py

+
+def _check_for_chunk_existence(vespa_chunk_id: uuid.UUID, index_name: str) -> bool:
+    vespa_url = f"{DOCUMENT_ID_ENDPOINT.format(index_name=index_name)}/{vespa_chunk_id}"
+    # vesp aurl would be  http://localhost:8081/document/v1/default/danswer_chunk_nomic_ai_nomic_embed_text_v1/docid/


rkuo-danswer · 2025-01-05T11:15:52Z

backend/onyx/document_index/vespa/indexing_utils.py

+    vespa_url = f"{DOCUMENT_ID_ENDPOINT.format(index_name=index_name)}/{vespa_chunk_id}"
+    # vesp aurl would be  http://localhost:8081/document/v1/default/danswer_chunk_nomic_ai_nomic_embed_text_v1/docid/
+    try:
+        with get_vespa_http_client() as http_client:


is this retried anywhere above this call? we generally need a retry strategy for anything hitting vespa

rkuo-danswer · 2025-01-05T11:16:19Z

backend/onyx/document_index/vespa/index.py

-            index_names = [self.index_name]
-            if self.secondary_index_name:
-                index_names.append(self.secondary_index_name)
+        # TODO: incorporate


looks like WIP?

rkuo-danswer · 2025-01-05T11:17:54Z

backend/onyx/document_index/vespa/index.py

+                doc_infos.append(doc_chunk_info_with_index)
+
+            # Now, for each doc, we know exactly where to start and end our deletion
+            # So let's genrate the chunk IDs for each chunk to delete


Weves

Can't wait to get this one in 🙏

backend/onyx/document_index/vespa/index.py

Weves · 2025-01-05T22:30:26Z

backend/onyx/document_index/vespa/deletion.py

@@ -23,46 +23,42 @@ def _retryable_http_delete(http_client: httpx.Client, url: str) -> None:


 @retry(tries=3, delay=1, backoff=2)


we should probably remove this retry since the _retryable_http_delete already has retries built in

backend/onyx/indexing/indexing_pipeline.py

backend/onyx/document_index/vespa/index.py

Weves · 2025-01-05T22:43:28Z

backend/onyx/document_index/vespa/deletion.py

    index_name: str,
    http_client: httpx.Client,
    executor: concurrent.futures.ThreadPoolExecutor | None = None,
 ) -> None:
+    if not _does_doc_chunk_exist(doc_chunk_ids[0], index_name, http_client):


nit: can we add a comment as to why we check first rather than just always deleting?

This was more for my sake during testing and we shouldn't need it unless something goes wrong with the chunk UUID scheme (ie. we don't reach the logic where we update the database with chunk count, but do modify the number of chunks)

Weves

lgtm 🦺

nit

59fdc26

vercel bot deployed to Preview January 4, 2025 19:37 View deployment

k

2e1c058

vercel bot deployed to Preview January 4, 2025 19:44 View deployment

add steps

c2a616b

vercel bot deployed to Preview January 4, 2025 21:51 View deployment

main util functions

4e0a86f

vercel bot deployed to Preview January 5, 2025 00:48 View deployment

pablonyx added 2 commits January 4, 2025 18:18

functioning fully

cf81395

quick nit

57d1c22

vercel bot deployed to Preview January 5, 2025 02:22 View deployment

k

90a906b

vercel bot deployed to Preview January 5, 2025 02:28 View deployment

rkuo-danswer reviewed Jan 5, 2025

View reviewed changes

pablonyx added 2 commits January 5, 2025 11:08

typing fix

90b8d50

k

2251093

vercel bot deployed to Preview January 5, 2025 19:11 View deployment

vercel bot deployed to Preview January 5, 2025 19:13 View deployment

Weves reviewed Jan 5, 2025

View reviewed changes

address comments

c5c14ec

vercel bot deployed to Preview January 5, 2025 23:00 View deployment

Weves approved these changes Jan 5, 2025

View reviewed changes

Weves enabled auto-merge January 5, 2025 23:18

Weves added this pull request to the merge queue Jan 5, 2025

Merged via the queue into main with commit ddec239 Jan 6, 2025
13 checks passed

Weves deleted the indexing_improved branch January 6, 2025 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved indexing #3594

Improved indexing #3594

pablonyx commented Jan 4, 2025 •

edited

Loading

vercel bot commented Jan 4, 2025 •

edited

Loading

rkuo-danswer Jan 5, 2025

rkuo-danswer Jan 5, 2025

rkuo-danswer Jan 5, 2025

rkuo-danswer Jan 5, 2025

rkuo-danswer Jan 5, 2025

Weves left a comment

Weves Jan 5, 2025

Weves Jan 5, 2025

pablonyx Jan 5, 2025

Weves left a comment

		@@ -23,46 +23,42 @@ def _retryable_http_delete(http_client: httpx.Client, url: str) -> None:


		@retry(tries=3, delay=1, backoff=2)

Improved indexing #3594

Improved indexing #3594

Conversation

pablonyx commented Jan 4, 2025 • edited Loading

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

vercel bot commented Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Weves left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Weves left a comment

Choose a reason for hiding this comment

pablonyx commented Jan 4, 2025 •

edited

Loading

vercel bot commented Jan 4, 2025 •

edited

Loading