-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add ObjectBox Vector Store Integration (#16314)
- Loading branch information
1 parent
62849af
commit 2b1dc7d
Showing
13 changed files
with
1,044 additions
and
0 deletions.
There are no files selected for viewing
294 changes: 294 additions & 0 deletions
294
docs/docs/examples/vector_stores/ObjectBoxIndexDemo.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,294 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# ObjectBox VectorStore Demo\n", | ||
"\n", | ||
"This notebook will demonstrate the use of [ObjectBox](https://objectbox.io/) as an efficient, on-device vector-store with LlamaIndex. We will consider a simple RAG use-case where given a document, the user can ask questions and get relevant answers from a LLM in natural language. The RAG pipeline will be configured along the following verticals:\n", | ||
"\n", | ||
"* A builtin [`SimpleDirectoryReader` reader](https://docs.llamaindex.ai/en/stable/examples/data_connectors/simple_directory_reader/) from LlamaIndex\n", | ||
"* A builtin [`SentenceSplitter` node-parser](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/) from LlamaIndex\n", | ||
"* Models from [HuggingFace as embedding providers](https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/)\n", | ||
"* [ObjectBox](https://objectbox.io/) as NoSQL and vector datastore\n", | ||
"* Google's [Gemini](https://docs.llamaindex.ai/en/stable/examples/llm/gemini/) as a remote LLM service\n", | ||
"\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 1) Installing dependencies\n", | ||
"\n", | ||
"We install integrations for HuggingFace and Gemini to use along with LlamaIndex" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/1.6 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r", | ||
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━\u001b[0m \u001b[32m1.5/1.6 MB\u001b[0m \u001b[31m40.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r", | ||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m25.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", | ||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.0/4.0 MB\u001b[0m \u001b[31m44.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", | ||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.5/1.5 MB\u001b[0m \u001b[31m38.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", | ||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m37.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", | ||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.4/76.4 kB\u001b[0m \u001b[31m5.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", | ||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", | ||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.3/49.3 kB\u001b[0m \u001b[31m3.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", | ||
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m3.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", | ||
"\u001b[?25h" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!pip install llama_index_vector_stores_objectbox --quiet\n", | ||
"!pip install llama-index --quiet\n", | ||
"!pip install llama-index-embeddings-huggingface --quiet\n", | ||
"!pip install llama-index-llms-gemini --quiet" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 2) Downloading the documents" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!mkdir -p 'data/paul_graham/'\n", | ||
"!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 3) Setup a LLM for RAG (Gemini)\n", | ||
"\n", | ||
"We use Google Gemini's cloud-based API as a LLM. You can get an API-key from the [console](https://aistudio.google.com/app/apikey)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from llama_index.llms.gemini import Gemini\n", | ||
"import getpass\n", | ||
"\n", | ||
"gemini_key_api = getpass.getpass(\"Gemini API Key: \")\n", | ||
"gemini_llm = Gemini(api_key=gemini_key_api)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 4) Setup an embedding model for RAG (HuggingFace `bge-small-en-v1.5`)\n", | ||
"\n", | ||
"HuggingFace hosts a variety of embedding models, which could be observed from the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n", | ||
"\n", | ||
"hf_embedding = HuggingFaceEmbedding(model_name=\"BAAI/bge-base-en-v1.5\")\n", | ||
"embedding_dim = 384" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 5) Prepare documents and nodes\n", | ||
"\n", | ||
"In a RAG pipeline, the first step is to read the given documents. We use the `SimpleDirectoryReader` that selects the best file reader by checking the file extension from the directory.\n", | ||
"\n", | ||
"Next, we produce chunks (text subsequences) from the contents read by the `SimpleDirectoryReader` from the documents. A `SentenceSplitter` is a text-splitter that preserves sentence boundaries while splitting the text into chunks of size `chunk_size`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from llama_index.core import SimpleDirectoryReader\n", | ||
"from llama_index.core.node_parser import SentenceSplitter\n", | ||
"\n", | ||
"reader = SimpleDirectoryReader(\"./data/paul_graham\")\n", | ||
"documents = reader.load_data()\n", | ||
"\n", | ||
"node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)\n", | ||
"nodes = node_parser.get_nodes_from_documents(documents)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 6) Configure `ObjectBoxVectorStore`\n", | ||
"\n", | ||
"The `ObjectBoxVectorStore` can be initialized with several options:\n", | ||
"\n", | ||
"- `embedding_dim` (required): The dimensions of the embeddings that the vector DB will hold\n", | ||
"- `distance_type`: Choose from `COSINE`, `DOT_PRODUCT`, `DOT_PRODUCT_NON_NORMALIZED` and `EUCLIDEAN`\n", | ||
"- `db_directory`: The path of the directory where the `.mdb` ObjectBox database file should be created\n", | ||
"- `clear_db`: Deletes the existing database file if it exists on `db_directory`\n", | ||
"- `do_log`: Enables logging from the ObjectBox integration" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from llama_index.vector_stores.objectbox import ObjectBoxVectorStore\n", | ||
"from llama_index.core import StorageContext, VectorStoreIndex, Settings\n", | ||
"from objectbox import VectorDistanceType\n", | ||
"\n", | ||
"vector_store = ObjectBoxVectorStore(\n", | ||
" embedding_dim,\n", | ||
" distance_type=VectorDistanceType.COSINE,\n", | ||
" db_directory=\"obx_data\",\n", | ||
" clear_db=False,\n", | ||
" do_log=True,\n", | ||
")\n", | ||
"\n", | ||
"storage_context = StorageContext.from_defaults(vector_store=vector_store)\n", | ||
"\n", | ||
"Settings.llm = gemini_llm\n", | ||
"Settings.embed_model = hf_embedding\n", | ||
"\n", | ||
"index = VectorStoreIndex(nodes=nodes, storage_context=storage_context)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 7) Chat with the document" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"query_engine = index.as_query_engine()\n", | ||
"response = query_engine.query(\"Who is Paul Graham?\")\n", | ||
"print(response)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Optional: Configuring `ObjectBoxVectorStore` as a retriever\n", | ||
"\n", | ||
"A LlamaIndex [retriever](https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/) is responsible for fetching similar chunks from a vector DB given a query.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"retriever = index.as_retriever()\n", | ||
"response = retriever.retrieve(\"What did the author do growing up?\")\n", | ||
"\n", | ||
"for node in response:\n", | ||
" print(\"Retrieved chunk text:\\n\", node.node.get_text())\n", | ||
" print(\"Retrieved chunk metadata:\\n\", node.node.get_metadata_str())\n", | ||
" print(\"\\n\\n\\n\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Optional: Removing chunks associated with a single query using `delete_nodes`\n", | ||
"\n", | ||
"We can use the `ObjectBoxVectorStore.delete_nodes` method to remove chunks (nodes) from the vector DB providing a list containing node IDs as an argument." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"response = retriever.retrieve(\"What did the author do growing up?\")\n", | ||
"\n", | ||
"node_ids = []\n", | ||
"for node in response:\n", | ||
" node_ids.append(node.node_id)\n", | ||
"print(f\"Nodes to be removed: {node_ids}\")\n", | ||
"\n", | ||
"print(f\"No. of vectors before deletion: {vector_store.count()}\")\n", | ||
"vector_store.delete_nodes(node_ids)\n", | ||
"print(f\"No. of vectors after deletion: {vector_store.count()}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Optional: Removing a single document from the vector DB\n", | ||
"\n", | ||
"The `ObjectBoxVectorStore.delete` method can be used to remove chunks (nodes) associated with a single document whose `id_` is provided as an argument.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"document = documents[0]\n", | ||
"print(f\"Document to be deleted {document.id_}\")\n", | ||
"\n", | ||
"print(f\"No. of vectors before deletion: {vector_store.count()}\")\n", | ||
"vector_store.delete(document.id_)\n", | ||
"print(f\"No. of vectors after document deletion: {vector_store.count()}\")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"colab": { | ||
"provenance": [] | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"name": "python" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |
Oops, something went wrong.