community[patch]: Microsoft Azure Document Intelligence updates (lang…

…chain-ai#16932) - **Description:** Update Azure Document Intelligence implementation by Microsoft team and RAG cookbook with Azure AI Search --------- Co-authored-by: Lu Zhang (AI) <[email protected]> Co-authored-by: Yateng Hong <[email protected]> Co-authored-by: teethache <[email protected]> Co-authored-by: Lu Zhang <[email protected]> Co-authored-by: Eugene Yurtsev <[email protected]> Co-authored-by: Bagatur <[email protected]> Co-authored-by: Bagatur <[email protected]>
billytrend-cohere · Mar 27, 2024 · f12cb0b · f12cb0b
1 parent cd79305
commit f12cb0b
Show file tree

Hide file tree

Showing 12 changed files with 708 additions and 71 deletions.
diff --git a/cookbook/rag_semantic_chunking_azureaidocintelligence.ipynb b/cookbook/rag_semantic_chunking_azureaidocintelligence.ipynb
diff --git a/docs/docs/integrations/document_loaders/azure_document_intelligence.ipynb b/docs/docs/integrations/document_loaders/azure_document_intelligence.ipynb
@@ -14,32 +14,30 @@
    "metadata": {},
    "source": [
     ">[Azure AI Document Intelligence](https://aka.ms/doc-intelligence) (formerly known as `Azure Form Recognizer`) is machine-learning \n",
-    ">based service that extracts text (including handwriting), tables or key-value-pairs from\n",
-    ">scanned documents or images.\n",
+    ">based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from\n",
+    ">digital or scanned PDFs, images, Office and HTML files.\n",
     ">\n",
-    ">Document Intelligence supports `PDF`, `JPEG`, `PNG`, `BMP`, or `TIFF`.\n",
+    ">Document Intelligence supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.\n",
     "\n",
-    "This current implementation of a loader using `Document Intelligence` can incorporate content page-wise and turn it into LangChain documents.\n"
+    "This current implementation of a loader using `Document Intelligence` can incorporate content page-wise and turn it into LangChain documents. The default output format is markdown, which can be easily chained with `MarkdownHeaderTextSplitter` for semantic document chunking. You can also use `mode=\"single\"` or `mode=\"page\"` to return pure texts in a single page or document split by page.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisite\n",
+    "\n",
+    "An Azure AI Document Intelligence resource in one of the 3 preview regions: **East US**, **West US2**, **West Europe** - follow [this document](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0) to create one if you don't have. You will be passing `<endpoint>` and `<key>` as parameters to the loader."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3.2\u001b[0m\n",
-      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython3 -m pip install --upgrade pip\u001b[0m\n",
-      "Note: you may need to restart the kernel to use updated packages.\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "%pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence -q"
+    "%pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence"
    ]
   },
   {
@@ -106,7 +104,7 @@
    "metadata": {},
    "source": [
     "## Example 2\n",
-    "The input file can also be URL path."
+    "The input file can also be a public URL path. E.g., https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/rest-api/layout.png."
    ]
   },
   {
@@ -123,6 +121,101 @@
     "documents = loader.load()"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example 3\n",
+    "You can also specify `mode=\"page\"` to load document by pages."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader\n",
+    "\n",
+    "file_path = \"<filepath>\"\n",
+    "endpoint = \"<endpoint>\"\n",
+    "key = \"<key>\"\n",
+    "loader = AzureAIDocumentIntelligenceLoader(\n",
+    "    api_endpoint=endpoint,\n",
+    "    api_key=key,\n",
+    "    file_path=file_path,\n",
+    "    api_model=\"prebuilt-layout\",\n",
+    "    mode=\"page\",\n",
+    ")\n",
+    "\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The output will be each page stored as a separate document in the list:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for document in documents:\n",
+    "    print(f\"Page Content: {document.page_content}\")\n",
+    "    print(f\"Metadata: {document.metadata}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example 4\n",
+    "You can also specify `analysis_feature=[\"ocrHighResolution\"]` to enable add-on capabilities. For more information, see: https://aka.ms/azsdk/python/documentintelligence/analysisfeature."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader\n",
+    "\n",
+    "file_path = \"<filepath>\"\n",
+    "endpoint = \"<endpoint>\"\n",
+    "key = \"<key>\"\n",
+    "analysis_features = [\"ocrHighResolution\"]\n",
+    "loader = AzureAIDocumentIntelligenceLoader(\n",
+    "    api_endpoint=endpoint,\n",
+    "    api_key=key,\n",
+    "    file_path=file_path,\n",
+    "    api_model=\"prebuilt-layout\",\n",
+    "    analysis_features=analysis_features,\n",
+    ")\n",
+    "\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The output contains the LangChain document recognized with high resolution add-on capability:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,

diff --git a/docs/docs/integrations/document_loaders/microsoft_excel.ipynb b/docs/docs/integrations/document_loaders/microsoft_excel.ipynb
@@ -43,13 +43,60 @@
     "docs[0]"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "729ab1a2",
+   "metadata": {},
+   "source": [
+    "## Using Azure AI Document Intelligence\n",
+    "\n",
+    ">[Azure AI Document Intelligence](https://aka.ms/doc-intelligence) (formerly known as `Azure Form Recognizer`) is machine-learning \n",
+    ">based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from\n",
+    ">digital or scanned PDFs, images, Office and HTML files.\n",
+    ">\n",
+    ">Document Intelligence supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.\n",
+    "\n",
+    "This current implementation of a loader using `Document Intelligence` can incorporate content page-wise and turn it into LangChain documents. The default output format is markdown, which can be easily chained with `MarkdownHeaderTextSplitter` for semantic document chunking. You can also use `mode=\"single\"` or `mode=\"page\"` to return pure texts in a single page or document split by page.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fbe5c77d",
+   "metadata": {},
+   "source": [
+    "### Prerequisite\n",
+    "\n",
+    "An Azure AI Document Intelligence resource in one of the 3 preview regions: **East US**, **West US2**, **West Europe** - follow [this document](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0) to create one if you don't have. You will be passing `<endpoint>` and `<key>` as parameters to the loader."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "9ab94bde",
+   "id": "fda529f8",
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "%pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aa008547",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader\n",
+    "\n",
+    "file_path = \"<filepath>\"\n",
+    "endpoint = \"<endpoint>\"\n",
+    "key = \"<key>\"\n",
+    "loader = AzureAIDocumentIntelligenceLoader(\n",
+    "    api_endpoint=endpoint, api_key=key, file_path=file_path, api_model=\"prebuilt-layout\"\n",
+    ")\n",
+    "\n",
+    "documents = loader.load()"
+   ]
   }
  ],
  "metadata": {

diff --git a/docs/docs/integrations/document_loaders/microsoft_powerpoint.ipynb b/docs/docs/integrations/document_loaders/microsoft_powerpoint.ipynb
@@ -76,7 +76,7 @@
    "id": "525d6b67",
    "metadata": {},
    "source": [
-    "## Retain Elements\n",
+    "### Retain Elements\n",
     "\n",
     "Under the hood, `Unstructured` creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
    ]
@@ -124,13 +124,60 @@
     "data[0]"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b97180c2",
+   "metadata": {},
+   "source": [
+    "## Using Azure AI Document Intelligence\n",
+    "\n",
+    ">[Azure AI Document Intelligence](https://aka.ms/doc-intelligence) (formerly known as `Azure Form Recognizer`) is machine-learning \n",
+    ">based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from\n",
+    ">digital or scanned PDFs, images, Office and HTML files.\n",
+    ">\n",
+    ">Document Intelligence supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.\n",
+    "\n",
+    "This current implementation of a loader using `Document Intelligence` can incorporate content page-wise and turn it into LangChain documents. The default output format is markdown, which can be easily chained with `MarkdownHeaderTextSplitter` for semantic document chunking. You can also use `mode=\"single\"` or `mode=\"page\"` to return pure texts in a single page or document split by page.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "11851fd0",
+   "metadata": {},
+   "source": [
+    "## Prerequisite\n",
+    "\n",
+    "An Azure AI Document Intelligence resource in one of the 3 preview regions: **East US**, **West US2**, **West Europe** - follow [this document](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0) to create one if you don't have. You will be passing `<endpoint>` and `<key>` as parameters to the loader."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "id": "381d4139",
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "%pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "077525b8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader\n",
+    "\n",
+    "file_path = \"<filepath>\"\n",
+    "endpoint = \"<endpoint>\"\n",
+    "key = \"<key>\"\n",
+    "loader = AzureAIDocumentIntelligenceLoader(\n",
+    "    api_endpoint=endpoint, api_key=key, file_path=file_path, api_model=\"prebuilt-layout\"\n",
+    ")\n",
+    "\n",
+    "documents = loader.load()"
+   ]
   }
  ],
  "metadata": {

diff --git a/docs/docs/integrations/document_loaders/microsoft_word.ipynb b/docs/docs/integrations/document_loaders/microsoft_word.ipynb
@@ -147,7 +147,7 @@
    "id": "525d6b67",
    "metadata": {},
    "source": [
-    "## Retain Elements\n",
+    "### Retain Elements\n",
     "\n",
     "Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
    ]
@@ -192,6 +192,59 @@
    "source": [
     "data[0]"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c1f3b83f",
+   "metadata": {},
+   "source": [
+    "## Using Azure AI Document Intelligence\n",
+    "\n",
+    ">[Azure AI Document Intelligence](https://aka.ms/doc-intelligence) (formerly known as `Azure Form Recognizer`) is machine-learning \n",
+    ">based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from\n",
+    ">digital or scanned PDFs, images, Office and HTML files.\n",
+    ">\n",
+    ">Document Intelligence supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.\n",
+    "\n",
+    "This current implementation of a loader using `Document Intelligence` can incorporate content page-wise and turn it into LangChain documents. The default output format is markdown, which can be easily chained with `MarkdownHeaderTextSplitter` for semantic document chunking. You can also use `mode=\"single\"` or `mode=\"page\"` to return pure texts in a single page or document split by page.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a5bd47c2",
+   "metadata": {},
+   "source": [
+    "## Prerequisite\n",
+    "\n",
+    "An Azure AI Document Intelligence resource in one of the 3 preview regions: **East US**, **West US2**, **West Europe** - follow [this document](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0) to create one if you don't have. You will be passing `<endpoint>` and `<key>` as parameters to the loader."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "71cbdfe0",
+   "metadata": {},
+   "source": [
+    "%pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "691bd9e8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader\n",
+    "\n",
+    "file_path = \"<filepath>\"\n",
+    "endpoint = \"<endpoint>\"\n",
+    "key = \"<key>\"\n",
+    "loader = AzureAIDocumentIntelligenceLoader(\n",
+    "    api_endpoint=endpoint, api_key=key, file_path=file_path, api_model=\"prebuilt-layout\"\n",
+    ")\n",
+    "\n",
+    "documents = loader.load()"
+   ]
   }
  ],
  "metadata": {

diff --git a/docs/docs/integrations/platforms/microsoft.mdx b/docs/docs/integrations/platforms/microsoft.mdx
@@ -84,10 +84,11 @@ from langchain.document_loaders import AzureAIDataLoader
 
 >[Azure AI Document Intelligence](https://aka.ms/doc-intelligence) (formerly known
 > as `Azure Form Recognizer`) is machine-learning
-> based service that extracts text (including handwriting), tables or key-value-pairs
-> from scanned documents or images.
+> based service that extracts texts (including handwriting), tables, document structures, 
+> and key-value-pairs
+> from digital or scanned PDFs, images, Office and HTML files.
 >
->Document Intelligence supports `PDF`, `JPEG`, `PNG`, `BMP`, or `TIFF`.
+> Document Intelligence supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.
 
 First, you need to install a python package.