New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add embeddings cache #8976

Merged

eyurtsev merged 5 commits into master from eugene/embeddings_cache

Aug 10, 2023

Collaborator

eyurtsev commented Aug 9, 2023

This PR adds the ability to temporarily cache or persistently store embeddings.

A notebook has been included showing how to set up the cache and how to use it
with a vectorstore.

8ce223c

vercel bot commented Aug 9, 2023 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Aug 9, 2023 5:32pm

eyurtsev requested a review from baskaryan

August 9, 2023 16:41

dosubot bot added Ɑ: embeddings 🤖:enhancement labels

eyurtsev requested review from nfcampos and hwchase17

August 9, 2023 16:41

vercel bot deployed to Preview – langchain

August 9, 2023 16:50

View deployment

baskaryan reviewed

View reviewed changes

Collaborator

baskaryan left a comment •

edited

Loading

high level: are we sure we prefer explicitly wrapping the underlying model vs having a sep global cache? that's what we have for chat and llms, which i think is slightly nicer UX. but definitely see how this makes for cleaner code

baskaryan reviewed

View reviewed changes

libs/langchain/langchain/embeddings/cache.py Outdated

		return cast(List[float], json.loads(serialized_value.decode()))


		class CacheBackedEmbedder(Embeddings):

Collaborator

baskaryan Aug 9, 2023 •

edited

Loading

nit: would call CacheBackedEmbeddings and underlying_embeddings. don't disagree that embedder might be clearer name but think consistency is higher priority

baskaryan reviewed

View reviewed changes

libs/langchain/langchain/embeddings/cache.py Show resolved Hide resolved

libs/langchain/langchain/embeddings/cache.py Outdated Show resolved Hide resolved

libs/langchain/langchain/embeddings/cache.py Outdated Show resolved Hide resolved

libs/langchain/langchain/embeddings/cache.py Outdated Show resolved Hide resolved

libs/langchain/langchain/embeddings/cache.py Outdated

+                      Returns:
+                          The embedding for the given text.
+                      """
+                      # Query is not cached at the moment.

Collaborator

baskaryan Aug 9, 2023

why not?

Collaborator Author

eyurtsev Aug 9, 2023

see doc-string above ^

libs/langchain/langchain/embeddings/cache.py Outdated Show resolved Hide resolved

eyurtsev added 4 commits

August 9, 2023 13:12

36a8e25

96a8a88

a9467b4


          fix type

0c25fc9

eyurtsev requested a review from baskaryan

August 9, 2023 19:09

hinthornw reviewed

View reviewed changes

libs/langchain/langchain/embeddings/cache.py

+              def _value_serializer(value: Sequence[float]) -> bytes:
+                  """Serialize a value."""
+                  return json.dumps(value).encode()

Collaborator

hinthornw Aug 9, 2023

Is this the kind of thing where you should specify the codec or na

Collaborator Author

eyurtsev Aug 10, 2023

could you paraphrase?

the store that goes into the embedder accepts an encoder for the key, and also a serializer (both encoder and decoder) for the value

hinthornw reviewed

View reviewed changes

docs/extras/modules/data_connection/caching_embeddings.ipynb

Collaborator

hinthornw Aug 9, 2023

This is great! I see we only have in-memory and file store storage implementations at the moment, so maybe it's too early to provide more guidance here, but I'd imagine users would want some direction on what the recommended approach is to avoid redundantly storing docs in a vector store.

Collaborator Author

eyurtsev Aug 10, 2023

Could you paraphrase a bit? Are you thinking about actual docs in the vectorstore or caches for embeddings.

For the former, I'm planning on PRing something next week to help with that!

This PR is for storing the hashes and the resulting embeddings in the key-value store.

This data won't consume that much space. The larger issue is that it can create a lot of keys, and the key-value store may have some trouble listing the keys (e.g., not a good idea to have 1 million files in a given directory). For the file system store, we should re-write the file system to store the keys in a tree structure when this becomes an issue.

I'm guessing in general once we reach ball park of ~1 million docs that need to be embedded, some of the design decisions will have to be re-evaluated.

baskaryan approved these changes

View reviewed changes

Collaborator

baskaryan left a comment

lgtm!

Collaborator Author

eyurtsev commented Aug 9, 2023

@baskaryan will review current llm cache set up

Collaborator Author

eyurtsev commented Aug 10, 2023

high level: are we sure we prefer explicitly wrapping the underlying model vs having a sep global cache? that's what we have for chat and llms, which i think is slightly nicer UX. but definitely see how this makes for cleaner code

@baskaryan I think we can add the global cache specification in the future (I think it's complimentary to these changes) -- although I am not sure that I like the idea of doing it -- I think it may encourage large structural issues with user code only to save a few lines of code

eyurtsev merged commit 5e05ba2 into master

eyurtsev deleted the eugene/embeddings_cache branch

August 10, 2023 15:15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ɑ: embeddings 🤖:enhancement