-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multiple indexes in a vector store #16972
base: main
Are you sure you want to change the base?
Conversation
@@ -235,36 +240,46 @@ def from_params( | |||
def class_name(cls) -> str: | |||
return "ChromaVectorStore" | |||
|
|||
def move_nodes(self, from_index_id: str, to_index_id: str): | |||
all_nodes = self._collection.get(where={"__index_id__": from_index_id}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would bring every node into memory. Not ideal tbh for large datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you for getting back to me on this.
I agree with your comment on the ChromaDB implementation. Not ideal!
Maybe something like this is better:
all_metadatas = self._collection.get(where={"__index_id__": from_index_id}, include=[IncludeEnum.metadatas])
ids = all_metadatas["ids"]
updated_metadatas = [{**m, "__index_id__": to_index_id} for m in all_metadatas["metadatas"]]
self._collection.update(ids, metadatas=updated_metadatas)
And maybe there are even better ways of doing it. I haven't looked too deeply into ChromaDB.
Regarding your "potential" disagreement with the implementation. What exactly do you disagree with? Is it the implementation, or the concept?
If it's the actual implementation you disagree with I would be more than happy to have a conversation around it. The provided implementation is just meant as a proof-of-concept to put forward an idea of how this problem can be fixed. Just as a reminder of what I mean when I say "problem":
patient_index = VectorStore.from_documents(patient_records)
hospital_procedures_index = VectorStore.from_documents(hospital_procedures_docs)
......
......
hospital_procedures_index = load_index_from_storage(index_id="hospital_procedures_idx")
procedures = hospital_procedures_index.as_retriever().retrieve("<query>")
The code above does not do what one intuitively would expect! :)
Lets say "another" developer does something like this:
hospital_procedures_index = load_index_from_storage(index_id="hospital_procedures_idx")
procedures = hospital_procedures_index.as_retriever().retrieve("<query>")
# procedures <----- May contain sensitive patient data!!
I have seen this happen, and it's unlikely that this would get caught during normal testing. (Why would you check if the hospital_procedures_idx contains patient data?)
This would of course be an absolute disaster!
I don't mean to create a big rant here, but in my opinion, this is a HUGE design flaw of the whole storage / "persitence" layer of llamaIndex that really should be addressed. The best solution would probably be to rewrite the whole subsystem as there are several other somewhat related issues that could be addressed at the same time. For example:
- Recursive retrieval (which I think is one of the most exciting features of llamaIndex core. Other frameworks don't have this at all!)
- Proper persistence of "object_map" in the Retrievers. (The object_map is lost after persistence)
Implementing the changes I propose would be an easy and quick fix to at least make it robust and usable.
Fixing this subsystem would make LlamaIndex the most Kick-ass framework in this space! (In my opinion :) )
Sorry for the delay on this! Being totally honest, I'm not sure this will get merged (or at least not anytime soon). I really appreciate the effort to get this working. Not 100% sure I agree with the implementation just yet, but also, nothing wrong with installing/maintaining your fork if you need this functionality. For transparency, we are currently focusing on some multimodal refactors that we really need to get in first |
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes # (issue)
New Package?
Did I fill in the
tool.llamahub
section in thepyproject.toml
and provide a detailed README.md for my new integration or package?Version Bump?
Did I bump the version in the
pyproject.toml
file of the package I am updating? (Except for thellama-index-core
package)Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.
Suggested Checklist:
make format; make lint
to appease the lint gods