From d83db10d2b02d3f40700ead4de53de62f9d917b3 Mon Sep 17 00:00:00 2001 From: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Date: Tue, 8 Oct 2024 14:00:29 +0200 Subject: [PATCH] chore: fix Docling example (Colab env integration, typos) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --- .../data_connectors/DoclingReaderDemo.ipynb | 132 ++++++++---------- 1 file changed, 56 insertions(+), 76 deletions(-) diff --git a/docs/docs/examples/data_connectors/DoclingReaderDemo.ipynb b/docs/docs/examples/data_connectors/DoclingReaderDemo.ipynb index dbcb237268da4..8a8ce610a53d1 100644 --- a/docs/docs/examples/data_connectors/DoclingReaderDemo.ipynb +++ b/docs/docs/examples/data_connectors/DoclingReaderDemo.ipynb @@ -27,7 +27,7 @@ "source": [ "[Docling](https://github.com/DS4SD/docling) extracts PDF documents into a rich representation (incl. layout, tables etc.), which it can export to Markdown or JSON.\n", "\n", - "The `DoclingReader` seamlessly integrates Docling into LlamaIndex, enabling you to:\n", + "Docling Reader and Docling Node Parser presented in this notebook seamlessly integrate Docling into LlamaIndex, enabling you to:\n", "- use PDF documents in your LLM applications with ease and speed, and\n", "- leverage Docling's rich format for advanced, document-native grounding." ] @@ -36,31 +36,32 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Notebook setup" + "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "> 👉 For best conversion speed, use GPU acceleration whenever available (e.g. if running on Colab, use a GPU-enabled runtime)." + "- 👉 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.\n", + "- Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var `HF_TOKEN`.\n", + "- Requirements can be installed as shown below (`--no-warn-conflicts` meant for Colab's pre-populated Python env; feel free to remove for stricter usage):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], + "outputs": [], "source": [ - "%pip install -q llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-readers-file python-dotenv" + "%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-readers-file python-dotenv" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now define the main parameters:" ] }, { @@ -74,14 +75,28 @@ "import os\n", "from dotenv import load_dotenv\n", "\n", + "\n", + "def get_env_from_colab_or_os(key):\n", + " try:\n", + " from google.colab import userdata\n", + "\n", + " try:\n", + " return userdata.get(key)\n", + " except userdata.SecretNotFoundError:\n", + " pass\n", + " except ImportError:\n", + " pass\n", + " return os.getenv(key)\n", + "\n", + "\n", "load_dotenv()\n", - "source = \"https://arxiv.org/pdf/2408.09869\" # Docling Technical Report\n", - "query = \"Which are the main AI models in Docling?\"\n", - "embed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\n", - "gen_model = HuggingFaceInferenceAPI(\n", - " token=os.getenv(\"HF_TOKEN\"),\n", + "EMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\n", + "GEN_MODEL = HuggingFaceInferenceAPI(\n", + " token=get_env_from_colab_or_os(\"HF_TOKEN\"),\n", " model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n", - ")" + ")\n", + "SOURCE = \"https://arxiv.org/pdf/2408.09869\" # Docling Technical Report\n", + "QUERY = \"Which are the main AI models in Docling?\"" ] }, { @@ -96,7 +111,7 @@ "metadata": {}, "source": [ "To create a simple RAG pipeline, we can:\n", - "- define a `DoclingPDFReader`, which by default exports to Markdown, and\n", + "- define a `DoclingReader`, which by default exports to Markdown, and\n", "- use a standard node parser for these Markdown-based docs, e.g. a `MarkdownNodeParser`" ] }, @@ -105,20 +120,6 @@ "execution_count": null, "metadata": {}, "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "4b7b5ee0f1b945f49103169144091dfa", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Fetching 10 files: 0%| | 0/10 [00:00