Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved indexing #3594

Merged
merged 10 commits into from
Jan 6, 2025
Merged

Improved indexing #3594

merged 10 commits into from
Jan 6, 2025

Conversation

pablonyx
Copy link
Contributor

@pablonyx pablonyx commented Jan 4, 2025

Description

For indexing:

  • Document: new value chunk_count
  • If no chunk_count, we can just assume that it has the old UUID system (including non tenancy):
    • If large chunks enabled, generate the UUID based on the old UUID logic + large chunk reference IDs
  • If chunk_count, can infer UUIDs + exact values for deletion

How Has This Been Tested?

In multi tenant and single-tenant scenario:

  • Index various docs
  • Increase size of doc
  • Decrease size of doc
  • Ensure proper doc chunk IDs are deleted in Vespa
  • Do the above specifically when migrating from main to this branch

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)

Copy link

vercel bot commented Jan 4, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 5, 2025 11:00pm

document_id_to_current_chunks_indexed: dict[str, int],
db_session: Session,
) -> None:
documents_to_update = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are document_id's and document_id_to_current_chunks_indexed the same docs? what's the difference?


def _check_for_chunk_existence(vespa_chunk_id: uuid.UUID, index_name: str) -> bool:
vespa_url = f"{DOCUMENT_ID_ENDPOINT.format(index_name=index_name)}/{vespa_chunk_id}"
# vesp aurl would be http://localhost:8081/document/v1/default/danswer_chunk_nomic_ai_nomic_embed_text_v1/docid/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

vespa_url = f"{DOCUMENT_ID_ENDPOINT.format(index_name=index_name)}/{vespa_chunk_id}"
# vesp aurl would be http://localhost:8081/document/v1/default/danswer_chunk_nomic_ai_nomic_embed_text_v1/docid/
try:
with get_vespa_http_client() as http_client:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this retried anywhere above this call? we generally need a retry strategy for anything hitting vespa

index_names = [self.index_name]
if self.secondary_index_name:
index_names.append(self.secondary_index_name)
# TODO: incorporate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like WIP?

doc_infos.append(doc_chunk_info_with_index)

# Now, for each doc, we know exactly where to start and end our deletion
# So let's genrate the chunk IDs for each chunk to delete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Contributor

@Weves Weves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't wait to get this one in 🙏

backend/onyx/document_index/vespa/index.py Show resolved Hide resolved
@@ -23,46 +23,42 @@ def _retryable_http_delete(http_client: httpx.Client, url: str) -> None:


@retry(tries=3, delay=1, backoff=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably remove this retry since the _retryable_http_delete already has retries built in

backend/onyx/indexing/indexing_pipeline.py Outdated Show resolved Hide resolved
backend/onyx/indexing/indexing_pipeline.py Outdated Show resolved Hide resolved
backend/onyx/document_index/vespa/index.py Show resolved Hide resolved
index_name: str,
http_client: httpx.Client,
executor: concurrent.futures.ThreadPoolExecutor | None = None,
) -> None:
if not _does_doc_chunk_exist(doc_chunk_ids[0], index_name, http_client):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a comment as to why we check first rather than just always deleting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was more for my sake during testing and we shouldn't need it unless something goes wrong with the chunk UUID scheme (ie. we don't reach the logic where we update the database with chunk count, but do modify the number of chunks)

Copy link
Contributor

@Weves Weves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 🦺

@Weves Weves enabled auto-merge January 5, 2025 23:18
@Weves Weves added this pull request to the merge queue Jan 5, 2025
Merged via the queue into main with commit ddec239 Jan 6, 2025
13 checks passed
@Weves Weves deleted the indexing_improved branch January 6, 2025 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants