Add ObjectBox Vector Store Integration (#16314)

run-llama · Oct 7, 2024 · 2b1dc7d · 2b1dc7d
1 parent 62849af
commit 2b1dc7d
Show file tree

Hide file tree

Showing 13 changed files with 1,044 additions and 0 deletions.
diff --git a/docs/docs/examples/vector_stores/ObjectBoxIndexDemo.ipynb b/docs/docs/examples/vector_stores/ObjectBoxIndexDemo.ipynb
@@ -0,0 +1,294 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# ObjectBox VectorStore Demo\n",
+    "\n",
+    "This notebook will demonstrate the use of [ObjectBox](https://objectbox.io/) as an efficient, on-device vector-store with LlamaIndex. We will consider a simple RAG use-case where given a document, the user can ask questions and get relevant answers from a LLM in natural language. The RAG pipeline will be configured along the following verticals:\n",
+    "\n",
+    "* A builtin [`SimpleDirectoryReader` reader](https://docs.llamaindex.ai/en/stable/examples/data_connectors/simple_directory_reader/) from LlamaIndex\n",
+    "* A builtin [`SentenceSplitter` node-parser](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/) from LlamaIndex\n",
+    "* Models from [HuggingFace as embedding providers](https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/)\n",
+    "* [ObjectBox](https://objectbox.io/) as NoSQL and vector datastore\n",
+    "* Google's [Gemini](https://docs.llamaindex.ai/en/stable/examples/llm/gemini/) as a remote LLM service\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1) Installing dependencies\n",
+    "\n",
+    "We install integrations for HuggingFace and Gemini to use along with LlamaIndex"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[?25l   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/1.6 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r",
+      "\u001b[2K   \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━\u001b[0m \u001b[32m1.5/1.6 MB\u001b[0m \u001b[31m40.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m25.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.0/4.0 MB\u001b[0m \u001b[31m44.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.5/1.5 MB\u001b[0m \u001b[31m38.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m37.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.4/76.4 kB\u001b[0m \u001b[31m5.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.3/49.3 kB\u001b[0m \u001b[31m3.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m3.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25h"
+     ]
+    }
+   ],
+   "source": [
+    "!pip install llama_index_vector_stores_objectbox --quiet\n",
+    "!pip install llama-index --quiet\n",
+    "!pip install llama-index-embeddings-huggingface --quiet\n",
+    "!pip install llama-index-llms-gemini --quiet"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2) Downloading the documents"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!mkdir -p 'data/paul_graham/'\n",
+    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3) Setup a LLM for RAG (Gemini)\n",
+    "\n",
+    "We use Google Gemini's cloud-based API as a LLM. You can get an API-key from the [console](https://aistudio.google.com/app/apikey)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.llms.gemini import Gemini\n",
+    "import getpass\n",
+    "\n",
+    "gemini_key_api = getpass.getpass(\"Gemini API Key: \")\n",
+    "gemini_llm = Gemini(api_key=gemini_key_api)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4) Setup an embedding model for RAG (HuggingFace `bge-small-en-v1.5`)\n",
+    "\n",
+    "HuggingFace hosts a variety of embedding models, which could be observed from the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
+    "\n",
+    "hf_embedding = HuggingFaceEmbedding(model_name=\"BAAI/bge-base-en-v1.5\")\n",
+    "embedding_dim = 384"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5) Prepare documents and nodes\n",
+    "\n",
+    "In a RAG pipeline, the first step is to read the given documents. We use the `SimpleDirectoryReader` that selects the best file reader by checking the file extension from the directory.\n",
+    "\n",
+    "Next, we produce chunks (text subsequences) from the contents read by the `SimpleDirectoryReader` from the documents. A `SentenceSplitter` is a text-splitter that preserves sentence boundaries while splitting the text into chunks of size `chunk_size`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core import SimpleDirectoryReader\n",
+    "from llama_index.core.node_parser import SentenceSplitter\n",
+    "\n",
+    "reader = SimpleDirectoryReader(\"./data/paul_graham\")\n",
+    "documents = reader.load_data()\n",
+    "\n",
+    "node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)\n",
+    "nodes = node_parser.get_nodes_from_documents(documents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6) Configure `ObjectBoxVectorStore`\n",
+    "\n",
+    "The `ObjectBoxVectorStore` can be initialized with several options:\n",
+    "\n",
+    "- `embedding_dim` (required): The dimensions of the embeddings that the vector DB will hold\n",
+    "- `distance_type`: Choose from `COSINE`, `DOT_PRODUCT`, `DOT_PRODUCT_NON_NORMALIZED` and `EUCLIDEAN`\n",
+    "- `db_directory`: The path of the directory where the `.mdb` ObjectBox database file should be created\n",
+    "- `clear_db`: Deletes the existing database file if it exists on `db_directory`\n",
+    "- `do_log`: Enables logging from the ObjectBox integration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.vector_stores.objectbox import ObjectBoxVectorStore\n",
+    "from llama_index.core import StorageContext, VectorStoreIndex, Settings\n",
+    "from objectbox import VectorDistanceType\n",
+    "\n",
+    "vector_store = ObjectBoxVectorStore(\n",
+    "    embedding_dim,\n",
+    "    distance_type=VectorDistanceType.COSINE,\n",
+    "    db_directory=\"obx_data\",\n",
+    "    clear_db=False,\n",
+    "    do_log=True,\n",
+    ")\n",
+    "\n",
+    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
+    "\n",
+    "Settings.llm = gemini_llm\n",
+    "Settings.embed_model = hf_embedding\n",
+    "\n",
+    "index = VectorStoreIndex(nodes=nodes, storage_context=storage_context)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7) Chat with the document"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_engine = index.as_query_engine()\n",
+    "response = query_engine.query(\"Who is Paul Graham?\")\n",
+    "print(response)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Optional: Configuring `ObjectBoxVectorStore` as a retriever\n",
+    "\n",
+    "A LlamaIndex [retriever](https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/) is responsible for fetching similar chunks from a vector DB given a query.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "retriever = index.as_retriever()\n",
+    "response = retriever.retrieve(\"What did the author do growing up?\")\n",
+    "\n",
+    "for node in response:\n",
+    "    print(\"Retrieved chunk text:\\n\", node.node.get_text())\n",
+    "    print(\"Retrieved chunk metadata:\\n\", node.node.get_metadata_str())\n",
+    "    print(\"\\n\\n\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Optional: Removing chunks associated with a single query using `delete_nodes`\n",
+    "\n",
+    "We can use the `ObjectBoxVectorStore.delete_nodes` method to remove chunks (nodes) from the vector DB providing a list containing node IDs as an argument."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = retriever.retrieve(\"What did the author do growing up?\")\n",
+    "\n",
+    "node_ids = []\n",
+    "for node in response:\n",
+    "    node_ids.append(node.node_id)\n",
+    "print(f\"Nodes to be removed: {node_ids}\")\n",
+    "\n",
+    "print(f\"No. of vectors before deletion: {vector_store.count()}\")\n",
+    "vector_store.delete_nodes(node_ids)\n",
+    "print(f\"No. of vectors after deletion: {vector_store.count()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Optional: Removing a single document from the vector DB\n",
+    "\n",
+    "The `ObjectBoxVectorStore.delete` method can be used to remove chunks (nodes) associated with a single document whose `id_` is provided as an argument.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "document = documents[0]\n",
+    "print(f\"Document to be deleted {document.id_}\")\n",
+    "\n",
+    "print(f\"No. of vectors before deletion: {vector_store.count()}\")\n",
+    "vector_store.delete(document.id_)\n",
+    "print(f\"No. of vectors after document deletion: {vector_store.count()}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}