Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated records in Chrome vectorstore after multiple cell executions #1

Open
labdmitriy opened this issue Feb 7, 2024 · 1 comment

Comments

@labdmitriy
Copy link

labdmitriy commented Feb 7, 2024

Hi @rlancemartin,

First of all thanks a lot for this series of lessons!

Probably it is known fact but for me it was not clearly for the first time when I found it, that if we run the cell this code from your Jupyter notebook for Lessons 1-4 multiple (for example, k) times:

vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=embeddings)
retriever = vectorstore.as_retriever()

Then there will be k duplicated records for each original record, because this method added documents even if collection already exists.
We can check it using this code for example:

vectorstore_data = vectorstore.get()
print(len(vectorstore_data['documents']))

As I remember, I saw similar behavior for langchain wrapper of Weaviate database.

So as a quick workaround we can remove default collection (which has name "langchain") before we add documents:

collection_name = 'langchain'
Chroma(collection_name=collection_name).delete_collection()
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=embeddings)
retriever = vectorstore.as_retriever()

Since there are no warnings or errors about existing collection, this feature may not be immediately noticed, so I hope it will be useful to someone.

P.S. I also noticed that during Part 4 here we can see that 4 documents are retrieved where 2 of them are duplicates of another ones.

Thank you.

@rlancemartin
Copy link
Collaborator

Yes! This is a good call out.

I will add a note in the notebooks on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants