-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
226 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,226 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Semantic Chunking with Chonkie and Model2Vec**\n", | ||
"\n", | ||
"Semantic chunking is a task of identifying the semantic boundaries of a piece of text. In this tutorial, we will use the [Chonkie](https://github.com/bhavnicksm/chonkie) library to perform semantic chunking on the book War & Peace. Chonkie is a library that provides a lightweight and fast solution to semantic chunking using pre-trained models. It supports our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) out of the box, which we will be using in this tutorial.\n", | ||
"\n", | ||
"After chunking our text, we will be using [Vicinity](https://github.com/MinishLab/vicinity), a lightweight nearest neighbors library, to create an index of our chunks and query them." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Install the necessary libraries\n", | ||
"!pip install datasets model2vec numpy tqdm vicinity\n", | ||
"\n", | ||
"# Import the necessary libraries\n", | ||
"import random \n", | ||
"import re\n", | ||
"import requests\n", | ||
"from chonkie import SemanticChunker\n", | ||
"from model2vec import StaticModel\n", | ||
"from vicinity import Vicinity\n", | ||
"\n", | ||
"random.seed(0)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Loading and pre-processing**\n", | ||
"\n", | ||
"First, we will download War and Peace and apply some basic pre-processing." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# URL for War and Peace on Project Gutenberg\n", | ||
"url = \"https://www.gutenberg.org/files/2600/2600-0.txt\"\n", | ||
"\n", | ||
"# Download the book\n", | ||
"response = requests.get(url)\n", | ||
"book_text = response.text\n", | ||
"\n", | ||
"def preprocess_text(text: str, min_length: int = 5):\n", | ||
" \"\"\"Basic text preprocessing function.\"\"\"\n", | ||
" text = text.replace(\"\\n\", \" \")\n", | ||
" text = text.replace(\"\\r\", \" \")\n", | ||
" sentences = re.findall(r'[^.!?]*[.!?]', text)\n", | ||
" # Filter out sentences shorter than the specified minimum length\n", | ||
" filtered_sentences = [sentence.strip() for sentence in sentences if len(sentence.split()) >= min_length]\n", | ||
" # Recombine the filtered sentences\n", | ||
" return ' '.join(filtered_sentences)\n", | ||
"\n", | ||
"# Preprocess the text\n", | ||
"book_text = preprocess_text(book_text)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 14, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Initialize a SemanticChunker from Chonkie with the potion-base-8M model\n", | ||
"chunker = SemanticChunker(\n", | ||
" embedding_model=\"minishlab/potion-base-8M\",\n", | ||
" similarity_threshold=0.3\n", | ||
")\n", | ||
"\n", | ||
"# Chunk the text\n", | ||
"chunks = chunker.chunk(book_text)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"And that's it, we chunked the entirety of War and Peace in ~3 seconds. Not bad! Let's look at some example chunks." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 15, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
" “Yes, I am well,” he said in answer to Pierre’s question, and smiled. To Pierre that smile said plainly: “I am well, but my health is now of no use to anyone. ” After a few words to Pierre about the awful roads from the Polish frontier, about people he had met in Switzerland who knew Pierre, and about M. Dessalles, whom he had brought from abroad to be his son’s tutor, Prince Andrew again joined warmly in the conversation about Speránski which was still going on between the two old men. “If there were treason, or proofs of secret relations with Napoleon, they would have been made public,” he said with warmth and haste. “I do not, and never did, like Speránski personally, but I like justice! ” Pierre now recognized in his friend a need with which he was only too familiar, to get excited and to have arguments about extraneous matters in order to stifle thoughts that were too oppressive and too intimate. When Prince Meshchérski had left, Prince Andrew took Pierre’s arm and asked him into the room that had been assigned him. A bed had been made up there, and some open portmanteaus and trunks stood about. Prince Andrew went to one and took out a small casket, from which he drew a packet wrapped in paper. He did it all silently and very quickly. He stood up and coughed.\n", | ||
" ” These were the questions each man of the troops on the high ground above the bridge involuntarily asked himself with a sinking heart—watching the bridge and the hussars in the bright evening light and the blue tunics advancing from the other side with their bayonets and guns. The hussars will get it hot!\n", | ||
" ” even Denísov cried to his adversary. Pierre, with a gentle smile of pity and remorse, his arms and legs helplessly spread out, stood with his broad chest directly facing Dólokhov and looked sorrowfully at him.\n", | ||
" ” said the officer of the suite, “that’s grapeshot.\n", | ||
" No, really, have you anything against me?\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Print a few example chunks\n", | ||
"for _ in range(5):\n", | ||
" chunk = random.choice(chunks)\n", | ||
" print(chunk.text)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Those look good. Next, let's create a vector search index with Vicinity and Model2Vec.\n", | ||
"\n", | ||
"**Creating a vector search index**" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Initialize an embedding model and encode the chunk texts\n", | ||
"model = StaticModel.from_pretrained(\"minishlab/potion-base-8M\")\n", | ||
"chunk_texts = [chunk.text for chunk in chunks]\n", | ||
"chunk_embeddings = model.encode(chunk_texts)\n", | ||
"\n", | ||
"# Create a Vicinity instance\n", | ||
"vicinity = Vicinity.from_vectors_and_items(vectors=chunk_embeddings, items=chunk_texts)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that we have our index, let's query it with some queries.\n", | ||
"\n", | ||
"**Querying the index**" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Query: Napoleon\n", | ||
"--------------------------------------------------\n", | ||
" He is alive,” said Napoleon. \n", | ||
"\n", | ||
" Why, that must be Napoleon’s own. \n", | ||
"\n", | ||
" Napoleon’s position is most brilliant. \n", | ||
"\n", | ||
"Query: The battle of Austerlitz\n", | ||
"--------------------------------------------------\n", | ||
" On the first arrival of the news of the battle of Austerlitz, Moscow had been bewildered. \n", | ||
"\n", | ||
" I remember his limited, self-satisfied face on the field of Austerlitz. \n", | ||
"\n", | ||
" That city is taken; the Russian army suffers heavier losses than the opposing armies had suffered in the former war from Austerlitz to Wagram. \n", | ||
"\n", | ||
"Query: Paris\n", | ||
"--------------------------------------------------\n", | ||
" “I have been in Paris. \n", | ||
"\n", | ||
" A man who doesn’t know Paris is a savage. You can tell a Parisian two leagues off. Paris is Talma, la Duchénois, Potier, the Sorbonne, the boulevards,” and noticing that his conclusion was weaker than what had gone before, he added quickly: “There is only one Paris in the world. You have been to Paris and have remained Russian. \n", | ||
"\n", | ||
" Well, what is Paris saying? \n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"queries = [\"Napoleon\", \"The battle of Austerlitz\", \"Paris\"]\n", | ||
"for query in queries:\n", | ||
" print(f\"Query: {query}\\n{'-' * 50}\")\n", | ||
" query_embedding = model.encode(query)\n", | ||
" results = vicinity.query(query_embedding, k=3)[0]\n", | ||
"\n", | ||
" for result in results:\n", | ||
" print(result[0], \"\\n\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"These indeed look like relevant chunks, nice! That's it for this tutorial. We were able to chunk, index, and query War and Peace in less than 5 seconds using Chonkie, Vicinity, and Model2Vec. " | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "3.10.12", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |