Convert unstructured data to embeddings, which will be stored in VectorStore. Vector similarities can be used to achieve semantic search.
The TextEncoder
converts text inputs to embeddings.
To adapt LangChain agent and vector store, this module must be a LangChain Embeddings
.
By default, it uses a modified HuggingFaceEmbeddings
in LangChain.
The modified HuggingFace encoder allows to return normalized embedding(s) by the parameter norm
.
To configure the default encoder, you change values of parameters via config.py
.
Modify embedding configs to configure it, like changing models or disable normalization:
# Text embedding
TEXTENCODER_CONFIG = {
'model': 'multi-qa-mpnet-base-cos-v1',
'norm': True
}
from text_encoder import TextEncoder
encoder = TextEncoder()
# Generate embeddings for a list of documents
doc_embeddings = encoder.embed_documents(['test'])
# Generate embedding for a text input
query_embedding = encoder.embed_query('test')
To change embedding method used in operations, you can modify init.py to import a different TextEncoder.
For example, changing .langchain_huggingface
to .openai_embedding
in the init file will switch to OpenAI embedding.
In the meantime, don't forget to modify config.py for a changed embedding method.
The text encoder should inherit LangChain Embeddings and you can rewrite the methods embed_documents
and embed_query
to customize the module:
from langchain.embeddings.base import Embeddings
def test_encoder(text: str):
return [0., 0.]
class TextEncoder(Embeddings):
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Embed search docs."""
outputs = []
for t in texts:
embed = test_encoder(t)
outputs.append(embed)
return outputs
def embed_query(self, text: str) -> List[float]:
"""Embed query text."""
return test_encoder(text)