Added tutorial

MinishLab · Nov 22, 2024 · 1ccb7a5 · 1ccb7a5
1 parent 55631e3
commit 1ccb7a5
Showing 1 changed file with 226 additions and 0 deletions.
diff --git a/tutorials/semantic_chunking.ipynb b/tutorials/semantic_chunking.ipynb
@@ -0,0 +1,226 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Semantic Chunking with Chonkie and Model2Vec**\n",
+    "\n",
+    "Semantic chunking is a task of identifying the semantic boundaries of a piece of text. In this tutorial, we will use the [Chonkie](https://github.com/bhavnicksm/chonkie) library to perform semantic chunking on the book War & Peace. Chonkie is a library that provides a lightweight and fast solution to semantic chunking using pre-trained models. It supports our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) out of the box, which we will be using in this tutorial.\n",
+    "\n",
+    "After chunking our text, we will be using [Vicinity](https://github.com/MinishLab/vicinity), a lightweight nearest neighbors library, to create an index of our chunks and query them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install the necessary libraries\n",
+    "!pip install datasets model2vec numpy tqdm vicinity\n",
+    "\n",
+    "# Import the necessary libraries\n",
+    "import random \n",
+    "import re\n",
+    "import requests\n",
+    "from chonkie import SemanticChunker\n",
+    "from model2vec import StaticModel\n",
+    "from vicinity import Vicinity\n",
+    "\n",
+    "random.seed(0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Loading and pre-processing**\n",
+    "\n",
+    "First, we will download War and Peace and apply some basic pre-processing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# URL for War and Peace on Project Gutenberg\n",
+    "url = \"https://www.gutenberg.org/files/2600/2600-0.txt\"\n",
+    "\n",
+    "# Download the book\n",
+    "response = requests.get(url)\n",
+    "book_text = response.text\n",
+    "\n",
+    "def preprocess_text(text: str, min_length: int = 5):\n",
+    "    \"\"\"Basic text preprocessing function.\"\"\"\n",
+    "    text = text.replace(\"\\n\", \" \")\n",
+    "    text = text.replace(\"\\r\", \" \")\n",
+    "    sentences = re.findall(r'[^.!?]*[.!?]', text)\n",
+    "    # Filter out sentences shorter than the specified minimum length\n",
+    "    filtered_sentences = [sentence.strip() for sentence in sentences if len(sentence.split()) >= min_length]\n",
+    "    # Recombine the filtered sentences\n",
+    "    return ' '.join(filtered_sentences)\n",
+    "\n",
+    "# Preprocess the text\n",
+    "book_text = preprocess_text(book_text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Initialize a SemanticChunker from Chonkie with the potion-base-8M model\n",
+    "chunker = SemanticChunker(\n",
+    "    embedding_model=\"minishlab/potion-base-8M\",\n",
+    "    similarity_threshold=0.3\n",
+    ")\n",
+    "\n",
+    "# Chunk the text\n",
+    "chunks = chunker.chunk(book_text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And that's it, we chunked the entirety of War and Peace in ~3 seconds. Not bad! Let's look at some example chunks."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " “Yes, I am well,” he said in  answer to Pierre’s question, and smiled. To Pierre that smile said plainly: “I am well, but my health is now of  no use to anyone. ”    After a few words to Pierre about the awful roads from the Polish  frontier, about people he had met in Switzerland who knew Pierre, and  about M. Dessalles, whom he had brought from abroad to be his son’s  tutor, Prince Andrew again joined warmly in the conversation about  Speránski which was still going on between the two old men. “If there were treason, or proofs of secret relations with Napoleon,  they would have been made public,” he said with warmth and haste. “I do  not, and never did, like Speránski personally, but I like justice! ”    Pierre now recognized in his friend a need with which he was only too  familiar, to get excited and to have arguments about extraneous matters  in order to stifle thoughts that were too oppressive and too intimate. When Prince Meshchérski had left, Prince Andrew took Pierre’s arm and  asked him into the room that had been assigned him. A bed had been made  up there, and some open portmanteaus and trunks stood about. Prince  Andrew went to one and took out a small casket, from which he drew a  packet wrapped in paper. He did it all silently and very quickly. He  stood up and coughed.\n",
+      " ” These were the questions each man of the  troops on the high ground above the bridge involuntarily asked himself  with a sinking heart—watching the bridge and the hussars in the bright  evening light and the blue tunics advancing from the other side with  their bayonets and guns. The hussars will get it hot!\n",
+      " ” even Denísov cried to his adversary. Pierre, with a gentle smile of pity and remorse, his arms and legs  helplessly spread out, stood with his broad chest directly facing  Dólokhov and looked sorrowfully at him.\n",
+      " ” said the officer of the suite, “that’s  grapeshot.\n",
+      " No, really, have you anything against me?\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Print a few example chunks\n",
+    "for _ in range(5):\n",
+    "    chunk = random.choice(chunks)\n",
+    "    print(chunk.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Those look good. Next, let's create a vector search index with Vicinity and Model2Vec.\n",
+    "\n",
+    "**Creating a vector search index**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Initialize an embedding model and encode the chunk texts\n",
+    "model = StaticModel.from_pretrained(\"minishlab/potion-base-8M\")\n",
+    "chunk_texts = [chunk.text for chunk in chunks]\n",
+    "chunk_embeddings = model.encode(chunk_texts)\n",
+    "\n",
+    "# Create a Vicinity instance\n",
+    "vicinity = Vicinity.from_vectors_and_items(vectors=chunk_embeddings, items=chunk_texts)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we have our index, let's query it with some queries.\n",
+    "\n",
+    "**Querying the index**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Query: Napoleon\n",
+      "--------------------------------------------------\n",
+      " He is alive,” said Napoleon. \n",
+      "\n",
+      " Why, that must be  Napoleon’s own. \n",
+      "\n",
+      " Napoleon’s position is most brilliant. \n",
+      "\n",
+      "Query: The battle of Austerlitz\n",
+      "--------------------------------------------------\n",
+      " On the first arrival of the news of the battle of Austerlitz, Moscow had  been bewildered. \n",
+      "\n",
+      " I remember his limited, self-satisfied face on the  field of Austerlitz. \n",
+      "\n",
+      " That  city is taken; the Russian army suffers heavier losses than the opposing  armies had suffered in the former war from Austerlitz to Wagram. \n",
+      "\n",
+      "Query: Paris\n",
+      "--------------------------------------------------\n",
+      " “I have been in Paris. \n",
+      "\n",
+      " A man who doesn’t know Paris  is a savage. You can tell a Parisian two leagues off. Paris is Talma, la  Duchénois, Potier, the Sorbonne, the boulevards,” and noticing that  his conclusion was weaker than what had gone before, he added quickly:  “There is only one Paris in the world. You have been to Paris and have  remained Russian. \n",
+      "\n",
+      " Well, what is Paris saying? \n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "queries = [\"Napoleon\", \"The battle of Austerlitz\", \"Paris\"]\n",
+    "for query in queries:\n",
+    "    print(f\"Query: {query}\\n{'-' * 50}\")\n",
+    "    query_embedding = model.encode(query)\n",
+    "    results = vicinity.query(query_embedding, k=3)[0]\n",
+    "\n",
+    "    for result in results:\n",
+    "        print(result[0], \"\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "These indeed look like relevant chunks, nice! That's it for this tutorial. We were able to chunk, index, and query War and Peace in less than 5 seconds using Chonkie, Vicinity, and Model2Vec. "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "3.10.12",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}