-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved indexing #3594
Improved indexing #3594
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
document_id_to_current_chunks_indexed: dict[str, int], | ||
db_session: Session, | ||
) -> None: | ||
documents_to_update = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are document_id's and document_id_to_current_chunks_indexed the same docs? what's the difference?
|
||
def _check_for_chunk_existence(vespa_chunk_id: uuid.UUID, index_name: str) -> bool: | ||
vespa_url = f"{DOCUMENT_ID_ENDPOINT.format(index_name=index_name)}/{vespa_chunk_id}" | ||
# vesp aurl would be http://localhost:8081/document/v1/default/danswer_chunk_nomic_ai_nomic_embed_text_v1/docid/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
vespa_url = f"{DOCUMENT_ID_ENDPOINT.format(index_name=index_name)}/{vespa_chunk_id}" | ||
# vesp aurl would be http://localhost:8081/document/v1/default/danswer_chunk_nomic_ai_nomic_embed_text_v1/docid/ | ||
try: | ||
with get_vespa_http_client() as http_client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this retried anywhere above this call? we generally need a retry strategy for anything hitting vespa
index_names = [self.index_name] | ||
if self.secondary_index_name: | ||
index_names.append(self.secondary_index_name) | ||
# TODO: incorporate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like WIP?
doc_infos.append(doc_chunk_info_with_index) | ||
|
||
# Now, for each doc, we know exactly where to start and end our deletion | ||
# So let's genrate the chunk IDs for each chunk to delete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't wait to get this one in 🙏
@@ -23,46 +23,42 @@ def _retryable_http_delete(http_client: httpx.Client, url: str) -> None: | |||
|
|||
|
|||
@retry(tries=3, delay=1, backoff=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably remove this retry since the _retryable_http_delete
already has retries built in
index_name: str, | ||
http_client: httpx.Client, | ||
executor: concurrent.futures.ThreadPoolExecutor | None = None, | ||
) -> None: | ||
if not _does_doc_chunk_exist(doc_chunk_ids[0], index_name, http_client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we add a comment as to why we check first rather than just always deleting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was more for my sake during testing and we shouldn't need it unless something goes wrong with the chunk UUID scheme (ie. we don't reach the logic where we update the database with chunk count, but do modify the number of chunks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 🦺
Description
For indexing:
chunk_count
chunk_count
, we can just assume that it has the old UUID system (including non tenancy):chunk_count
, can infer UUIDs + exact values for deletionHow Has This Been Tested?
In multi tenant and single-tenant scenario:
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.