Skip to content

Commit

Permalink
Add ObjectBox Vector Store Integration (#16314)
Browse files Browse the repository at this point in the history
  • Loading branch information
shubham0204 authored Oct 7, 2024
1 parent 62849af commit 2b1dc7d
Show file tree
Hide file tree
Showing 13 changed files with 1,044 additions and 0 deletions.
294 changes: 294 additions & 0 deletions docs/docs/examples/vector_stores/ObjectBoxIndexDemo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,294 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ObjectBox VectorStore Demo\n",
"\n",
"This notebook will demonstrate the use of [ObjectBox](https://objectbox.io/) as an efficient, on-device vector-store with LlamaIndex. We will consider a simple RAG use-case where given a document, the user can ask questions and get relevant answers from a LLM in natural language. The RAG pipeline will be configured along the following verticals:\n",
"\n",
"* A builtin [`SimpleDirectoryReader` reader](https://docs.llamaindex.ai/en/stable/examples/data_connectors/simple_directory_reader/) from LlamaIndex\n",
"* A builtin [`SentenceSplitter` node-parser](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/) from LlamaIndex\n",
"* Models from [HuggingFace as embedding providers](https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/)\n",
"* [ObjectBox](https://objectbox.io/) as NoSQL and vector datastore\n",
"* Google's [Gemini](https://docs.llamaindex.ai/en/stable/examples/llm/gemini/) as a remote LLM service\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1) Installing dependencies\n",
"\n",
"We install integrations for HuggingFace and Gemini to use along with LlamaIndex"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/1.6 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r",
"\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━\u001b[0m \u001b[32m1.5/1.6 MB\u001b[0m \u001b[31m40.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m25.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.0/4.0 MB\u001b[0m \u001b[31m44.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.5/1.5 MB\u001b[0m \u001b[31m38.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m37.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.4/76.4 kB\u001b[0m \u001b[31m5.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.3/49.3 kB\u001b[0m \u001b[31m3.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m3.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25h"
]
}
],
"source": [
"!pip install llama_index_vector_stores_objectbox --quiet\n",
"!pip install llama-index --quiet\n",
"!pip install llama-index-embeddings-huggingface --quiet\n",
"!pip install llama-index-llms-gemini --quiet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2) Downloading the documents"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p 'data/paul_graham/'\n",
"!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3) Setup a LLM for RAG (Gemini)\n",
"\n",
"We use Google Gemini's cloud-based API as a LLM. You can get an API-key from the [console](https://aistudio.google.com/app/apikey)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from llama_index.llms.gemini import Gemini\n",
"import getpass\n",
"\n",
"gemini_key_api = getpass.getpass(\"Gemini API Key: \")\n",
"gemini_llm = Gemini(api_key=gemini_key_api)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4) Setup an embedding model for RAG (HuggingFace `bge-small-en-v1.5`)\n",
"\n",
"HuggingFace hosts a variety of embedding models, which could be observed from the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
"\n",
"hf_embedding = HuggingFaceEmbedding(model_name=\"BAAI/bge-base-en-v1.5\")\n",
"embedding_dim = 384"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5) Prepare documents and nodes\n",
"\n",
"In a RAG pipeline, the first step is to read the given documents. We use the `SimpleDirectoryReader` that selects the best file reader by checking the file extension from the directory.\n",
"\n",
"Next, we produce chunks (text subsequences) from the contents read by the `SimpleDirectoryReader` from the documents. A `SentenceSplitter` is a text-splitter that preserves sentence boundaries while splitting the text into chunks of size `chunk_size`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from llama_index.core import SimpleDirectoryReader\n",
"from llama_index.core.node_parser import SentenceSplitter\n",
"\n",
"reader = SimpleDirectoryReader(\"./data/paul_graham\")\n",
"documents = reader.load_data()\n",
"\n",
"node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)\n",
"nodes = node_parser.get_nodes_from_documents(documents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6) Configure `ObjectBoxVectorStore`\n",
"\n",
"The `ObjectBoxVectorStore` can be initialized with several options:\n",
"\n",
"- `embedding_dim` (required): The dimensions of the embeddings that the vector DB will hold\n",
"- `distance_type`: Choose from `COSINE`, `DOT_PRODUCT`, `DOT_PRODUCT_NON_NORMALIZED` and `EUCLIDEAN`\n",
"- `db_directory`: The path of the directory where the `.mdb` ObjectBox database file should be created\n",
"- `clear_db`: Deletes the existing database file if it exists on `db_directory`\n",
"- `do_log`: Enables logging from the ObjectBox integration"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from llama_index.vector_stores.objectbox import ObjectBoxVectorStore\n",
"from llama_index.core import StorageContext, VectorStoreIndex, Settings\n",
"from objectbox import VectorDistanceType\n",
"\n",
"vector_store = ObjectBoxVectorStore(\n",
" embedding_dim,\n",
" distance_type=VectorDistanceType.COSINE,\n",
" db_directory=\"obx_data\",\n",
" clear_db=False,\n",
" do_log=True,\n",
")\n",
"\n",
"storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
"\n",
"Settings.llm = gemini_llm\n",
"Settings.embed_model = hf_embedding\n",
"\n",
"index = VectorStoreIndex(nodes=nodes, storage_context=storage_context)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7) Chat with the document"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query_engine = index.as_query_engine()\n",
"response = query_engine.query(\"Who is Paul Graham?\")\n",
"print(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Optional: Configuring `ObjectBoxVectorStore` as a retriever\n",
"\n",
"A LlamaIndex [retriever](https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/) is responsible for fetching similar chunks from a vector DB given a query.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"retriever = index.as_retriever()\n",
"response = retriever.retrieve(\"What did the author do growing up?\")\n",
"\n",
"for node in response:\n",
" print(\"Retrieved chunk text:\\n\", node.node.get_text())\n",
" print(\"Retrieved chunk metadata:\\n\", node.node.get_metadata_str())\n",
" print(\"\\n\\n\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Optional: Removing chunks associated with a single query using `delete_nodes`\n",
"\n",
"We can use the `ObjectBoxVectorStore.delete_nodes` method to remove chunks (nodes) from the vector DB providing a list containing node IDs as an argument."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response = retriever.retrieve(\"What did the author do growing up?\")\n",
"\n",
"node_ids = []\n",
"for node in response:\n",
" node_ids.append(node.node_id)\n",
"print(f\"Nodes to be removed: {node_ids}\")\n",
"\n",
"print(f\"No. of vectors before deletion: {vector_store.count()}\")\n",
"vector_store.delete_nodes(node_ids)\n",
"print(f\"No. of vectors after deletion: {vector_store.count()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Optional: Removing a single document from the vector DB\n",
"\n",
"The `ObjectBoxVectorStore.delete` method can be used to remove chunks (nodes) associated with a single document whose `id_` is provided as an argument.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"document = documents[0]\n",
"print(f\"Document to be deleted {document.id_}\")\n",
"\n",
"print(f\"No. of vectors before deletion: {vector_store.count()}\")\n",
"vector_store.delete(document.id_)\n",
"print(f\"No. of vectors after document deletion: {vector_store.count()}\")"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading

0 comments on commit 2b1dc7d

Please sign in to comment.