Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add embeddings cache #8976

Merged
merged 5 commits into from
Aug 10, 2023
Merged

Add embeddings cache #8976

merged 5 commits into from
Aug 10, 2023

Conversation

eyurtsev
Copy link
Collaborator

@eyurtsev eyurtsev commented Aug 9, 2023

This PR adds the ability to temporarily cache or persistently store embeddings.

A notebook has been included showing how to set up the cache and how to use it
with a vectorstore.

@vercel
Copy link

vercel bot commented Aug 9, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Aug 9, 2023 5:32pm

@eyurtsev eyurtsev requested a review from baskaryan August 9, 2023 16:41
@dosubot dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Aug 9, 2023
Copy link
Collaborator

@baskaryan baskaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high level: are we sure we prefer explicitly wrapping the underlying model vs having a sep global cache? that's what we have for chat and llms, which i think is slightly nicer UX. but definitely see how this makes for cleaner code

return cast(List[float], json.loads(serialized_value.decode()))


class CacheBackedEmbedder(Embeddings):
Copy link
Collaborator

@baskaryan baskaryan Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would call CacheBackedEmbeddings and underlying_embeddings. don't disagree that embedder might be clearer name but think consistency is higher priority

libs/langchain/langchain/embeddings/cache.py Show resolved Hide resolved
libs/langchain/langchain/embeddings/cache.py Outdated Show resolved Hide resolved
libs/langchain/langchain/embeddings/cache.py Outdated Show resolved Hide resolved
libs/langchain/langchain/embeddings/cache.py Outdated Show resolved Hide resolved
Returns:
The embedding for the given text.
"""
# Query is not cached at the moment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see doc-string above ^

libs/langchain/langchain/embeddings/cache.py Outdated Show resolved Hide resolved
@eyurtsev eyurtsev requested a review from baskaryan August 9, 2023 19:09

def _value_serializer(value: Sequence[float]) -> bytes:
"""Serialize a value."""
return json.dumps(value).encode()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the kind of thing where you should specify the codec or na

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you paraphrase?

the store that goes into the embedder accepts an encoder for the key, and also a serializer (both encoder and decoder) for the value

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I see we only have in-memory and file store storage implementations at the moment, so maybe it's too early to provide more guidance here, but I'd imagine users would want some direction on what the recommended approach is to avoid redundantly storing docs in a vector store.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you paraphrase a bit? Are you thinking about actual docs in the vectorstore or caches for embeddings.

For the former, I'm planning on PRing something next week to help with that!

This PR is for storing the hashes and the resulting embeddings in the key-value store.

This data won't consume that much space. The larger issue is that it can create a lot of keys, and the key-value store may have some trouble listing the keys (e.g., not a good idea to have 1 million files in a given directory). For the file system store, we should re-write the file system to store the keys in a tree structure when this becomes an issue.

I'm guessing in general once we reach ball park of ~1 million docs that need to be embedded, some of the design decisions will have to be re-evaluated.

Copy link
Collaborator

@baskaryan baskaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@eyurtsev
Copy link
Collaborator Author

eyurtsev commented Aug 9, 2023

@baskaryan will review current llm cache set up

@eyurtsev
Copy link
Collaborator Author

high level: are we sure we prefer explicitly wrapping the underlying model vs having a sep global cache? that's what we have for chat and llms, which i think is slightly nicer UX. but definitely see how this makes for cleaner code

@baskaryan I think we can add the global cache specification in the future (I think it's complimentary to these changes) -- although I am not sure that I like the idea of doing it -- I think it may encourage large structural issues with user code only to save a few lines of code

@eyurtsev eyurtsev merged commit 5e05ba2 into master Aug 10, 2023
@eyurtsev eyurtsev deleted the eugene/embeddings_cache branch August 10, 2023 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: embeddings Related to text embedding models module 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants