Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple indexes in a vector store #16972

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

vaaale
Copy link

@vaaale vaaale commented Nov 15, 2024

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Nov 15, 2024
@@ -235,36 +240,46 @@ def from_params(
def class_name(cls) -> str:
return "ChromaVectorStore"

def move_nodes(self, from_index_id: str, to_index_id: str):
all_nodes = self._collection.get(where={"__index_id__": from_index_id})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would bring every node into memory. Not ideal tbh for large datasets

Copy link
Author

@vaaale vaaale Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for getting back to me on this.

I agree with your comment on the ChromaDB implementation. Not ideal!
Maybe something like this is better:

    all_metadatas = self._collection.get(where={"__index_id__": from_index_id}, include=[IncludeEnum.metadatas])
    ids = all_metadatas["ids"]
    updated_metadatas = [{**m, "__index_id__": to_index_id} for m in all_metadatas["metadatas"]]
    self._collection.update(ids, metadatas=updated_metadatas)

And maybe there are even better ways of doing it. I haven't looked too deeply into ChromaDB.

Regarding your "potential" disagreement with the implementation. What exactly do you disagree with? Is it the implementation, or the concept?

If it's the actual implementation you disagree with I would be more than happy to have a conversation around it. The provided implementation is just meant as a proof-of-concept to put forward an idea of how this problem can be fixed. Just as a reminder of what I mean when I say "problem":

patient_index = VectorStore.from_documents(patient_records)
hospital_procedures_index = VectorStore.from_documents(hospital_procedures_docs)
......
......
hospital_procedures_index = load_index_from_storage(index_id="hospital_procedures_idx")
procedures = hospital_procedures_index.as_retriever().retrieve("<query>")

The code above does not do what one intuitively would expect! :)
Lets say "another" developer does something like this:

hospital_procedures_index = load_index_from_storage(index_id="hospital_procedures_idx")
procedures = hospital_procedures_index.as_retriever().retrieve("<query>")

# procedures  <----- May contain sensitive patient data!!

I have seen this happen, and it's unlikely that this would get caught during normal testing. (Why would you check if the hospital_procedures_idx contains patient data?)
This would of course be an absolute disaster!

I don't mean to create a big rant here, but in my opinion, this is a HUGE design flaw of the whole storage / "persitence" layer of llamaIndex that really should be addressed. The best solution would probably be to rewrite the whole subsystem as there are several other somewhat related issues that could be addressed at the same time. For example:

  • Recursive retrieval (which I think is one of the most exciting features of llamaIndex core. Other frameworks don't have this at all!)
  • Proper persistence of "object_map" in the Retrievers. (The object_map is lost after persistence)

Implementing the changes I propose would be an easy and quick fix to at least make it robust and usable.

Fixing this subsystem would make LlamaIndex the most Kick-ass framework in this space! (In my opinion :) )

@logan-markewich
Copy link
Collaborator

Sorry for the delay on this! Being totally honest, I'm not sure this will get merged (or at least not anytime soon).

I really appreciate the effort to get this working. Not 100% sure I agree with the implementation just yet, but also, nothing wrong with installing/maintaining your fork if you need this functionality.

For transparency, we are currently focusing on some multimodal refactors that we really need to get in first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants