diff --git a/README.md b/README.md index f46ab38..798c43f 100644 --- a/README.md +++ b/README.md @@ -381,13 +381,13 @@ result = embeddings.search("Risotto", 1) Model2Vec is the default model for semantic chunking in [Chonkie](https://github.com/bhavnicksm/chonkie). To use Model2Vec for semantic chunking in Chonkie, simply install Chonkie with `pip install chonkie[semantic]` and use one of the `potion` models in the `SemanticChunker` class. The following code snippet shows how to use Model2Vec in Chonkie: ```python -from chonkie import SemanticChunker +from chonkie import SDPMChunker # Create some example text to chunk text = "It's dangerous to go alone! Take this." # Initialize the SemanticChunker with a potion model -chunker = SemanticChunker( +chunker = SDPMChunker( embedding_model="minishlab/potion-base-8M", similarity_threshold=0.3 ) diff --git a/tutorials/semantic_chunking.ipynb b/tutorials/semantic_chunking.ipynb index 551d5a6..7da6045 100644 --- a/tutorials/semantic_chunking.ipynb +++ b/tutorials/semantic_chunking.ipynb @@ -13,9 +13,73 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 26, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n", + "To disable this warning, you can either:\n", + "\t- Avoid using `tokenizers` before the fork if possible\n", + "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: datasets in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (3.1.0)\n", + "Requirement already satisfied: model2vec in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (0.3.3)\n", + "Requirement already satisfied: numpy in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (2.1.3)\n", + "Requirement already satisfied: tqdm in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (4.67.0)\n", + "Requirement already satisfied: vicinity in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (0.2.1)\n", + "Requirement already satisfied: xxhash in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (3.5.0)\n", + "Requirement already satisfied: requests>=2.32.2 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (2.32.3)\n", + "Requirement already satisfied: fsspec[http]<=2024.9.0,>=2023.1.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (2024.9.0)\n", + "Requirement already satisfied: filelock in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (3.16.1)\n", + "Requirement already satisfied: multiprocess<0.70.17 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (0.70.16)\n", + "Requirement already satisfied: huggingface-hub>=0.23.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (0.26.2)\n", + "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (0.3.8)\n", + "Requirement already satisfied: pandas in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (2.2.3)\n", + "Requirement already satisfied: pyarrow>=15.0.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (18.0.0)\n", + "Requirement already satisfied: pyyaml>=5.1 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (6.0.2)\n", + "Requirement already satisfied: packaging in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (24.2)\n", + "Requirement already satisfied: aiohttp in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from datasets) (3.11.7)\n", + "Requirement already satisfied: tokenizers>=0.20 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from model2vec) (0.20.3)\n", + "Requirement already satisfied: jinja2 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from model2vec) (3.1.4)\n", + "Requirement already satisfied: setuptools in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from model2vec) (65.5.0)\n", + "Requirement already satisfied: safetensors in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from model2vec) (0.4.5)\n", + "Requirement already satisfied: rich in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from model2vec) (13.9.4)\n", + "Requirement already satisfied: orjson in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from vicinity) (3.10.11)\n", + "Requirement already satisfied: async-timeout<6.0,>=4.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from aiohttp->datasets) (5.0.1)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from aiohttp->datasets) (1.5.0)\n", + "Requirement already satisfied: propcache>=0.2.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from aiohttp->datasets) (0.2.0)\n", + "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from aiohttp->datasets) (2.4.3)\n", + "Requirement already satisfied: attrs>=17.3.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from aiohttp->datasets) (24.2.0)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from aiohttp->datasets) (6.1.0)\n", + "Requirement already satisfied: yarl<2.0,>=1.17.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from aiohttp->datasets) (1.18.0)\n", + "Requirement already satisfied: aiosignal>=1.1.2 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.1)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from huggingface-hub>=0.23.0->datasets) (4.12.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (3.10)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (3.4.0)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (2024.8.30)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (2.2.3)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from jinja2->model2vec) (3.0.2)\n", + "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from pandas->datasets) (2.9.0.post0)\n", + "Requirement already satisfied: pytz>=2020.1 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from pandas->datasets) (2024.2)\n", + "Requirement already satisfied: tzdata>=2022.7 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from pandas->datasets) (2024.2)\n", + "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from rich->model2vec) (2.18.0)\n", + "Requirement already satisfied: markdown-it-py>=2.2.0 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from rich->model2vec) (3.0.0)\n", + "Requirement already satisfied: mdurl~=0.1 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->model2vec) (0.1.2)\n", + "Requirement already satisfied: six>=1.5 in /Users/thomasvandongen/.pyenv/versions/3.10.12/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n", + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.3.1\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" + ] + } + ], "source": [ "# Install the necessary libraries\n", "!pip install datasets model2vec numpy tqdm vicinity\n", @@ -25,7 +89,7 @@ "import re\n", "import requests\n", "from time import perf_counter\n", - "from chonkie import SemanticChunker\n", + "from chonkie import SDPMChunker\n", "from model2vec import StaticModel\n", "from vicinity import Vicinity\n", "\n", @@ -79,20 +143,21 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Time taken: 1.9901759159984067\n" + "Number of chunks: 7261\n", + "Time taken: 2.201361084007658\n" ] } ], "source": [ "# Initialize a SemanticChunker from Chonkie with the potion-base-8M model\n", - "chunker = SemanticChunker(\n", + "chunker = SDPMChunker(\n", " embedding_model=\"minishlab/potion-base-8M\",\n", " similarity_threshold=0.3\n", ")\n", @@ -100,6 +165,7 @@ "# Chunk the text\n", "time = perf_counter()\n", "chunks = chunker.chunk(book_text)\n", + "print(f\"Number of chunks: {len(chunks)}\")\n", "print(f\"Time taken: {perf_counter() - time}\")" ] }, @@ -112,22 +178,22 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - " “He is sleeping well as it is, after a sleepless night. \n", + " And what role is your young monarch playing in that monstrous crowd? \n", "\n", - " In the yard, at the gates, at the window of the wings, wounded officers and their orderlies were to be seen. \n", + " How can you chuck it in like that or shove it under the cord where it’ll get rubbed? \n", "\n", - " Toward dawn, Count Orlóv-Denísov, who had dozed off, was awakened by a deserter from the French army being brought to him. This was a Polish sergeant of Poniatowski’s corps, who explained in Polish that he had come over because he had been slighted in the service: that he ought long ago to have been made an officer, that he was braver than any of them, and so he had left them and wished to pay them out. He said that Murat was spending the night less than a mile from where they were, and that if they would let him have a convoy of a hundred men he would capture him alive. Count Orlóv-Denísov consulted his fellow officers. \n", + " The general’s face clouded, his lips quivered and trembled. He took out a notebook, hurriedly scribbled something in pencil, tore out the leaf, gave it to Kozlóvski, stepped quickly to the window, and threw himself into a chair, gazing at those in the room as if asking, “Why do they look at me? ” Then he lifted his head, stretched his neck as if he intended to say something, but immediately, with affected indifference, began to hum to himself, producing a queer sound which immediately broke off. \n", "\n", - " But before the words were well out of his mouth, his cap flew off and a fierce blow jerked his head to one side. \n", + " “I like your being businesslike about it. ” And patting Berg on the shoulder he got up, wishing to end the conversation. But Berg, smiling pleasantly, explained that if he did not know for certain how much Véra would have and did not receive at least part of the dowry in advance, he would have to break matters off. “Because, consider, Count—if I allowed myself to marry now without having definite means to maintain my wife, I should be acting badly. ” The conversation ended by the count, who wished to be generous and to avoid further importunity, saying that he would give a note of hand for eighty thousand rubles. Berg smiled meekly, kissed the count on the shoulder, and said that he was very grateful, but that it was impossible for him to arrange his new life without receiving thirty thousand in ready money. “Or at least twenty thousand, Count,” he added, “and then a note of hand for only sixty thousand. ” “Yes, yes, all right! ” said the count hurriedly. “Only excuse me, my dear fellow, I’ll give you twenty thousand and a note of hand for eighty thousand as well. ” CHAPTER XII Natásha was sixteen and it was the year 1809, the very year to which she had counted on her fingers with Borís after they had kissed four years ago. Since then she had not seen him. Before Sónya and her mother, if Borís happened to be mentioned, she spoke quite freely of that episode as of some childish, long-forgotten matter that was not worth mentioning. But in the secret depths of her soul the question whether her engagement to Borís was a jest or an important, binding promise tormented her. \n", "\n", - " Any guard might arrest him, but by strange chance no one does so and all rapturously greet the man they cursed the day before and will curse again a month later. \n", + " Borís came to the Rostóvs’ box, received their congratulations very simply, and raising his eyebrows with an absent-minded smile conveyed to Natásha and Sónya his fiancée’s invitation to her wedding, and went away. Natásha with a gay, coquettish smile talked to him, and congratulated on his approaching wedding that same Borís with whom she had formerly been in love. In the state of intoxication she was in, everything seemed simple and natural. The scantily clad Hélène smiled at everyone in the same way, and Natásha gave Borís a similar smile. \n", "\n" ] } @@ -150,14 +216,14 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Time taken: 1.5010225840087514\n" + "Time taken: 1.6793621249962598\n" ] } ], @@ -177,14 +243,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Done! We embedded all our chunks and created an in index in 1.5 seconds. Now that we have our index, let's query it with some queries.\n", + "Done! We embedded all our chunks and created an in index in ~1.5 seconds. Now that we have our index, let's query it with some queries.\n", "\n", "**Querying the index**" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 48, "metadata": {}, "outputs": [ { @@ -193,27 +259,27 @@ "text": [ "Query: Napoleon\n", "--------------------------------------------------\n", - " He is alive,” said Napoleon. \n", - "\n", " Why, that must be Napoleon’s own. \n", "\n", - " Napoleon’s position is most brilliant. \n", + " That Napoleon has left Moscow? \n", + "\n", + " Napoleon was to enter the town next day. \n", "\n", "Query: The battle of Austerlitz\n", "--------------------------------------------------\n", - " On the first arrival of the news of the battle of Austerlitz, Moscow had been bewildered. \n", - "\n", " I remember his limited, self-satisfied face on the field of Austerlitz. \n", "\n", " That city is taken; the Russian army suffers heavier losses than the opposing armies had suffered in the former war from Austerlitz to Wagram. \n", "\n", + " Behave as you did at Austerlitz, Friedland, Vítebsk, and Smolénsk. \n", + "\n", "Query: Paris\n", "--------------------------------------------------\n", " “I have been in Paris. \n", "\n", " A man who doesn’t know Paris is a savage. You can tell a Parisian two leagues off. Paris is Talma, la Duchénois, Potier, the Sorbonne, the boulevards,” and noticing that his conclusion was weaker than what had gone before, he added quickly: “There is only one Paris in the world. You have been to Paris and have remained Russian. \n", "\n", - " Well, what is Paris saying? \n", + " It rises again from the same point as before—Paris. \n", "\n" ] }