Duplicated records in Chrome vectorstore after multiple cell executions #1

labdmitriy · 2024-02-07T17:02:47Z

First of all thanks a lot for this series of lessons!

Probably it is known fact but for me it was not clearly for the first time when I found it, that if we run the cell this code from your Jupyter notebook for Lessons 1-4 multiple (for example, k) times:

vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=embeddings)
retriever = vectorstore.as_retriever()

Then there will be k duplicated records for each original record, because this method added documents even if collection already exists.
We can check it using this code for example:

vectorstore_data = vectorstore.get()
print(len(vectorstore_data['documents']))

As I remember, I saw similar behavior for langchain wrapper of Weaviate database.

So as a quick workaround we can remove default collection (which has name "langchain") before we add documents:

collection_name = 'langchain'
Chroma(collection_name=collection_name).delete_collection()
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=embeddings)
retriever = vectorstore.as_retriever()

Since there are no warnings or errors about existing collection, this feature may not be immediately noticed, so I hope it will be useful to someone.

P.S. I also noticed that during Part 4 here we can see that 4 documents are retrieved where 2 of them are duplicates of another ones.

Thank you.

rlancemartin · 2024-02-18T01:09:35Z

Yes! This is a good call out.

I will add a note in the notebooks on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated records in Chrome vectorstore after multiple cell executions #1

Duplicated records in Chrome vectorstore after multiple cell executions #1

labdmitriy commented Feb 7, 2024 •

edited

Loading

rlancemartin commented Feb 18, 2024

Duplicated records in Chrome vectorstore after multiple cell executions #1

Duplicated records in Chrome vectorstore after multiple cell executions #1

Comments

labdmitriy commented Feb 7, 2024 • edited Loading

rlancemartin commented Feb 18, 2024

labdmitriy commented Feb 7, 2024 •

edited

Loading