diff --git a/gen-ai/Bedrock/04-idp-genai-advanced-rag.ipynb b/gen-ai/Bedrock/04-idp-genai-advanced-rag.ipynb new file mode 100644 index 0000000..82f2982 --- /dev/null +++ b/gen-ai/Bedrock/04-idp-genai-advanced-rag.ipynb @@ -0,0 +1,2748 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "377cae8e-8e75-49e8-932a-e7109e15e41d", + "metadata": {}, + "source": [ + "# Document Layout Aware Processing and Retrieval Augmented Generation.\n", + "\n", + "This notebook was tested on a SageMaker Studio Notebook `Data Science 3.0` kernel and `ml.t3.xlarge` instance.\n", + "\n", + "---\n", + "---\n", + "\n", + "## Contents\n", + "\n", + "1. [Objective](#Objective)\n", + "1. [Background](#Background-(Problem-Description-and-Approach))\n", + "1. [Document Extraction](#Document-Extraction)\n", + "1. [Document Processing](#Document-Processing)\n", + "1. [Document Chunking](#Document-Chunking)\n", + "1. [Indexing](#Indexing)\n", + "1. [RAG](#RAG)\n", + "1. [CleanUp](#CleanUp)\n", + "1. [Conclusion](#Conclusion)" + ] + }, + { + "cell_type": "markdown", + "id": "ead3d274-9977-44be-91e2-bf3b4bbbb745", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## Objective\n", + "\n", + "This example notebook guides you through the process of utilizing Amazon Textract's layout feature. This feature allows you to extract content from your document while maintaining its layout and reading format. Amazon Textract Layout feature is able to detect the following sections:\n", + "- Titles\n", + "- Headers\n", + "- Sub-headers\n", + "- Text\n", + "- Tables\n", + "- Figures\n", + "- List \n", + "- Footers\n", + "- Page Numbers\n", + "- Key-Value pairs\n", + "\n", + "Here is a snippet of Textract Layout feature on a page of Amazon Sustainability report using the Textract Console UI:\n", + "\n", + "\n", + "The [Amazon Textract Textractor Library](https://aws-samples.github.io/amazon-textract-textractor/index.html) is a library that seamlessly works with Textract features to aid in document processing. You can start by checking out the [examples in the documentation.](https://aws-samples.github.io/amazon-textract-textractor/notebooks/layout_analysis_for_text_linearization.html)\n", + "This notebook utilizes the Textractor library to interact with Amazon Textract and interpret its response. It enriches the extracted document text with XML tags to delineate sections, facilitating layout-aware chunking and document indexing into a Vector Database (DB). This process aims to enhance Retrieval Augmented Generation (RAG) performance.\n", + "\n", + "---\n", + "\n", + "## Background (Problem Description and Approach)\n", + "\n", + "- **Problem statement**: \n", + "RAG serves as a technique aimed at enhancing the effectiveness of Large Language Models (LLMs) on lengthy textual content. While widely adopted, implementing RAG necessitates initial processing to extract and segment text into meaningful chunks, especially challenging for intricate assets like PDFs. Many document parsing approaches overlook layout semantics or use simplistic methods like fixed window carving, lacking awareness of document structure or elements. This can disrupt contextual continuity and diminish the performance of RAG systems. An optimal RAG input pipeline would intelligently divide PDF texts into vectorized segments aligned with layout and content semantics, preserving informational integrity for the LLM. In essence, a context-aware parsing phase is pivotal for enabling RAG techniques to realize their full potential, particularly when handling extensive or intricate documents.\n", + "\n", + "- **Our approach**: \n", + "\n", + "\n", + "\n", + "1. Upload multi-page document to Amazon S3.\n", + "2. Call Amazon Textract Start Document Analysis api call to extract Document Text including Layout and Tables. The response provides structured text aligned with the original document formatting and the pandas tables of each table detected in the document.\n", + "3. Enrich this extracted text further with XML tags indicating semantic sections, adding contextual metadata through the Textractor library.\n", + "4. The textrcat library extracts tables in plain text, maintaining their original layout. However, for improved processing and manipulation, it's advisable to convert them to CSV format. This method replaces the plain text tables with their CSV counterparts obtained from Textract's table feature.\n", + "5. In this approach, the extracted text is segmented based on document title sections, the highest hierarchy level in a document. Each subsection within the title section is then chunked according to a maximum word threshold. Below outlines our approach to handling the chunking of subsection elements.:\n", + "\n", + " - **Tables:** Tables are chunked row by row until the maximum number of alphanumeric words is reached. For each table chunk, the column headers are added to the table along with the table header, typically the sentence or paragraph preceding the table in the document. This ensures that the information of the table is retained in each chunk.\n", + " \n", + " \n", + " \n", + " To handle tables with merged cells, this solution first unmerges any merged cell ranges, then duplicates the original merged cell value into each of the corresponding individual cells after unmerging.\n", + " \n", + " \n", + " \n", + " - **List:** Chunking lists found in documents can be challenging. Naive chunking methods often split list items by sentence or newline characters. However, this approach presents issues as only the first list chunk typically contains the list title, which provides essential information about the list items. Consequently, subsequent list chunks become obsolete. In this notebook, lists are chunked based on their individual list items. Additionally, the header of the list is appended to each list chunk to ensure that the information of the list is preserved in each chunk.\n", + " \n", + " \n", + " - **Section and subsection:** The structure of a document can generally be categorized into titles, sections, and paragraphs. A paragraph is typically the smallest unit of a document that conveys information independently, particularly within the context of a section or subsection header. In this method, text sections are chunked based on paragraphs, and the section header is added to each paragraph chunk (as well as tables and lists) within that section of the document.\n", + " \n", + " \n", + "6. Metadata is appended to each respective chunk during indexing, encompassing:\n", + " - The entire CSV tables detected within the chunk.\n", + " - The section header ID associated with the chunk.\n", + " - The section title ID linked to the chunk.\n", + " \n", + " When retrieving a passage based on hybrid search (combining semantic and text matching), there's flexibility in the amount of content forwarded to the LLM. Some queries may necessitate additional information, allowing users to choose whether to send the corresponding chunk subsection or title section based on the specific use case.\n", + "\n", + " *Some chunk may exceed the fixed word count threshold due to preserving paragraphs and dealing with complex tables. \n", + "\n", + "**Prerequisite:**\n", + "- [Amazon Bedrock model access](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html)" + ] + }, + { + "cell_type": "markdown", + "id": "2ba85156-5ae0-4e53-8aaf-0ec4ef1d7e2b", + "metadata": {}, + "source": [ + "## Step 1: Setup" + ] + }, + { + "cell_type": "markdown", + "id": "b2e75351-bc80-4416-a8d2-e2be0cadb07b", + "metadata": {}, + "source": [ + "Install required packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa108161-6352-4d2d-b2c9-cad24879b579", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install --force-reinstall amazon-textract-textractor==1.7.11\n", + "!pip install inflect\n", + "!pip install requests-aws4auth\n", + "!pip install opensearch-py\n", + "!pip install anthropic" + ] + }, + { + "cell_type": "markdown", + "id": "ff63129e-f3c9-41ea-bf48-3d2608a2531a", + "metadata": {}, + "source": [ + "Restart the Kernel \\\n", + "Click **kernel** on the top bar and **Restart Kernel**. Continue with the cells below." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "5d14636c-4f02-4b2d-874a-cc44de0c85c7", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n", + "sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml\n" + ] + } + ], + "source": [ + "import os\n", + "from PIL import Image\n", + "import pandas as pd\n", + "import re\n", + "import json\n", + "import uuid\n", + "from textractor import Textractor\n", + "from textractor.visualizers.entitylist import EntityList\n", + "from textractor.data.constants import TextractFeatures\n", + "import io\n", + "import inflect\n", + "from collections import OrderedDict\n", + "import boto3\n", + "import time\n", + "import a_opensearch_utilities_\n", + "import sagemaker\n", + "import openpyxl\n", + "from openpyxl.cell import Cell\n", + "from openpyxl.worksheet.cell_range import CellRange\n", + "s3=boto3.client(\"s3\")\n", + "from botocore.config import Config\n", + "config = Config(\n", + " read_timeout=600, \n", + " retries = dict(\n", + " max_attempts = 5 \n", + " )\n", + ")\n", + "from anthropic import Anthropic\n", + "client = Anthropic()\n", + "bedrock_runtime = boto3.client(service_name='bedrock-runtime',region_name='us-east-1',config=config)" + ] + }, + { + "cell_type": "markdown", + "id": "e3623dfa-1367-4d1d-9291-e16792aceab5", + "metadata": {}, + "source": [ + "# Create OpenSearch Serverless Collection" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "6b5a2da9-ef97-4ae6-a044-a9e4aba14c75", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import boto3\n", + "import time\n", + "import json\n", + "import os\n", + "vector_store_name = 'idp-workshop'\n", + "index_name = \"idp-workshop-rag\"\n", + "encryption_policy_name = \"idp-workshop-rag\"\n", + "network_policy_name = \"idp-workshop-rag\"\n", + "access_policy_name = 'idp-workshop-rag'\n", + "identity = boto3.client('sts').get_caller_identity()['Arn']\n", + "\n", + "aoss_client = boto3.client('opensearchserverless')\n", + "\n", + "security_policy = aoss_client.create_security_policy(\n", + " name = encryption_policy_name,\n", + " policy = json.dumps(\n", + " {\n", + " 'Rules': [{'Resource': ['collection/' + vector_store_name],\n", + " 'ResourceType': 'collection'}],\n", + " 'AWSOwnedKey': True\n", + " }),\n", + " type = 'encryption'\n", + ")\n", + "\n", + "network_policy = aoss_client.create_security_policy(\n", + " name = network_policy_name,\n", + " policy = json.dumps(\n", + " [\n", + " {'Rules': [{'Resource': ['collection/' + vector_store_name],\n", + " 'ResourceType': 'collection'}],\n", + " 'AllowFromPublic': True}\n", + " ]),\n", + " type = 'network'\n", + ")\n", + "\n", + "collection = aoss_client.create_collection(name=vector_store_name,type='VECTORSEARCH')\n", + "\n", + "while True:\n", + " status = aoss_client.list_collections(collectionFilters={'name':vector_store_name})['collectionSummaries'][0]['status']\n", + " if status in ('ACTIVE', 'FAILED'): break\n", + " time.sleep(10)\n", + "\n", + "access_policy = aoss_client.create_access_policy(\n", + " name = access_policy_name,\n", + " policy = json.dumps(\n", + " [\n", + " {\n", + " 'Rules': [\n", + " {\n", + " 'Resource': ['collection/' + vector_store_name],\n", + " 'Permission': [\n", + " 'aoss:CreateCollectionItems',\n", + " 'aoss:DeleteCollectionItems',\n", + " 'aoss:UpdateCollectionItems',\n", + " 'aoss:DescribeCollectionItems'],\n", + " 'ResourceType': 'collection'\n", + " },\n", + " {\n", + " 'Resource': ['index/' + vector_store_name + '/*'],\n", + " 'Permission': [\n", + " 'aoss:CreateIndex',\n", + " 'aoss:DeleteIndex',\n", + " 'aoss:UpdateIndex',\n", + " 'aoss:DescribeIndex',\n", + " 'aoss:ReadDocument',\n", + " 'aoss:WriteDocument'],\n", + " 'ResourceType': 'index'\n", + " }],\n", + " 'Principal': [identity],\n", + " 'Description': 'Easy data policy'}\n", + " ]),\n", + " type = 'data'\n", + ")\n", + "\n", + "host = collection['createCollectionDetail']['id'] + '.' + os.environ.get(\"AWS_DEFAULT_REGION\", None) + '.aoss.amazonaws.com'" + ] + }, + { + "cell_type": "markdown", + "id": "f3e2fefa-df80-48ec-8cfa-17c1099039f1", + "metadata": {}, + "source": [ + "Utility function for embedding generation using Amazon Titan v2 Embedding model. " + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "8144b0ae-4209-4a0d-aa71-6fa8097d1653", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def _get_emb_(passage):\n", + " \"\"\"\n", + " This function takes a passage of text and returns the corresponding text embedding using the Amazon Titan V2 Embedding model.\n", + " \"\"\"\n", + " response = bedrock_runtime.invoke_model(body=json.dumps({\"inputText\":passage,\"dimensions\":1024,\"normalize\":False}),\n", + " modelId=\"amazon.titan-embed-text-v2:0\", \n", + " accept=\"application/json\", \n", + " contentType=\"application/json\")\n", + "\n", + " response_body = json.loads(response.get('body').read())\n", + " embedding=response_body['embedding'] \n", + " return embedding" + ] + }, + { + "cell_type": "markdown", + "id": "e67f9a42", + "metadata": {}, + "source": [ + "Utility function to inference Anthropic Claude models on Bedrock." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "5351c441-f635-412d-9318-0a3b5b55cbe6", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def bedrock_streemer(response):\n", + " stream = response.get('body')\n", + " answer = \"\"\n", + " i = 1\n", + " if stream:\n", + " for event in stream:\n", + " chunk = event.get('chunk')\n", + " if chunk:\n", + " chunk_obj = json.loads(chunk.get('bytes').decode())\n", + " if \"delta\" in chunk_obj: \n", + " delta = chunk_obj['delta']\n", + " if \"text\" in delta:\n", + " text=delta['text'] \n", + " print(text, end=\"\")\n", + " answer+=str(text) \n", + " i+=1\n", + " if \"amazon-bedrock-invocationMetrics\" in chunk_obj:\n", + " input_tokens= chunk_obj['amazon-bedrock-invocationMetrics']['inputTokenCount']\n", + " output_tokens=chunk_obj['amazon-bedrock-invocationMetrics']['outputTokenCount']\n", + " print(f\"\\nInput Tokens: {input_tokens}\\nOutput Tokens: {output_tokens}\")\n", + " return answer,input_tokens, output_tokens\n", + "\n", + "def bedrock_claude_(chat_history,system_message, prompt,model_id,image_path=None):\n", + " content=[]\n", + " if image_path: \n", + " if not isinstance(image_path, list):\n", + " image_path=[image_path] \n", + " for img in image_path:\n", + " s3 = boto3.client('s3')\n", + " match = re.match(\"s3://(.+?)/(.+)\", img)\n", + " image_name=os.path.basename(img)\n", + " _,ext=os.path.splitext(image_name)\n", + " if \"jpg\" in ext: ext=\".jpeg\" \n", + " if match:\n", + " bucket_name = match.group(1)\n", + " key = match.group(2) \n", + " obj = s3.get_object(Bucket=bucket_name, Key=key)\n", + " base_64_encoded_data = base64.b64encode(obj['Body'].read())\n", + " base64_string = base_64_encoded_data.decode('utf-8')\n", + " content.extend([{\"type\":\"text\",\"text\":image_name},{\n", + " \"type\": \"image\",\n", + " \"source\": {\n", + " \"type\": \"base64\",\n", + " \"media_type\": f\"image/{ext.lower().replace('.','')}\",\n", + " \"data\": base64_string\n", + " }\n", + " }])\n", + " \n", + " content.append({\n", + " \"type\": \"text\",\n", + " \"text\": prompt\n", + " })\n", + " chat_history.append({\"role\": \"user\",\n", + " \"content\": content})\n", + " prompt = {\n", + " \"anthropic_version\": \"bedrock-2023-05-31\",\n", + " \"max_tokens\": 1500,\n", + " \"temperature\": 0.1,\n", + " \"system\":system_message,\n", + " \"messages\": chat_history\n", + " }\n", + " answer = \"\"\n", + " prompt = json.dumps(prompt)\n", + " response = bedrock_runtime.invoke_model_with_response_stream(body=prompt, modelId=model_id, accept=\"application/json\", contentType=\"application/json\")\n", + " answer,input_tokens,output_tokens=bedrock_streemer(response) \n", + " return answer, input_tokens, output_tokens\n", + "\n", + "def _invoke_bedrock_with_retries(current_chat, chat_template, question, model_id, image_path):\n", + " max_retries = 5\n", + " backoff_base = 2\n", + " max_backoff = 3 # Maximum backoff time in seconds\n", + " retries = 0\n", + "\n", + " while True:\n", + " try:\n", + " response,input_tokens,output_tokens = bedrock_claude_(current_chat, chat_template, question, model_id, image_path)\n", + " return response,input_tokens,output_tokens\n", + " except ClientError as e:\n", + " if e.response['Error']['Code'] == 'ThrottlingException':\n", + " if retries < max_retries:\n", + " # Throttling, exponential backoff\n", + " sleep_time = min(max_backoff, backoff_base ** retries + random.uniform(0, 1))\n", + " time.sleep(sleep_time)\n", + " retries += 1\n", + " else:\n", + " raise e\n", + " elif e.response['Error']['Code'] == 'ModelStreamErrorException':\n", + " if retries < max_retries:\n", + " # Throttling, exponential backoff\n", + " sleep_time = min(max_backoff, backoff_base ** retries + random.uniform(0, 1))\n", + " time.sleep(sleep_time)\n", + " retries += 1\n", + " else:\n", + " raise e\n", + " else:\n", + " # Some other API error, rethrow\n", + " raise\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "f7c7d155-4e4f-4fca-860f-bce713ceee42", + "metadata": { + "tags": [] + }, + "source": [ + "## Document Extraction\n", + "We employ the Amazon 2024 10K report as an example document. Using the textractor library, we trigger the Amazon Textract `start document analysis` API to initiate an asynchronous process for extracting document text and identifying additional elements like document layout and tables." + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "5c1d6899-82bd-49aa-8d64-ff4ebfdb235b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "BUCKET= sagemaker.Session().default_bucket()\n", + "extractor = Textractor(region_name=\"us-east-1\")\n", + "file=\"amazon-2024-10k.pdf\"\n", + "doc_id= os.path.basename(file)\n", + "file_name, ext = os.path.splitext(file)\n", + "\n", + "document = extractor.start_document_analysis(\n", + " file_source=f'../samples/{file}',\n", + " features=[TextractFeatures.LAYOUT,TextractFeatures.TABLES],\n", + " # client_request_token=doc_id,\n", + " save_image=False,\n", + " s3_upload_path=f\"s3://{BUCKET}\",\n", + " s3_output_path=f\"s3://{BUCKET}/textract-output/{file_name}/\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "efebf919-47a1-462d-8f21-e7ffe23cf6c3", + "metadata": {}, + "source": [ + "By leveraging the Textractor linearization function, we enhance the extracted content with XML tags while concealing certain page sections such as headers, footers, and non-essential images.\n", + "\n", + "We opt to tag tables, lists, title sections, and sub-sections to facilitate the efficient identification and chunking of these document elements.\n", + "\n", + "These tags are use to identify the various document elements and handle them appropiately. " + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "1f086761-25bd-4535-b1b1-10724abceae9", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "<
>
Competition
<
>\n", + "\n", + "Our businesses encompass a large variety of product types, service offerings, and delivery channels. The worldwide marketplace in which we compete is evolving rapidly and intensely competitive, and we face a broad array of competitors from many different industry sectors around the world. Our current and potential competitors include: (1) physical, e-commerce, and omnichannel retailers, publishers, vendors, distributors, manufacturers, and producers of the products we offer and sell to consumers and businesses; (2) publishers, producers, and distributors of physical, digital, and interactive media of all types and all distribution channels; (3) web search engines, comparison shopping websites, social networks, web portals, and other online and app-based means of discovering, using, or acquiring goods and services, either directly or in collaboration with other retailers; (4) companies that provide e-commerce services, including website development and hosting, omnichannel sales, inventory and supply chain management, advertising, fulfillment, customer service, and payment processing; (5) companies that provide fulfillment and logistics services for themselves or for third parties, whether online or offline; (6) companies that provide information technology services or products, including on-premises or cloud-based infrastructure and other services; (7) companies that design, manufacture, market, or sell consumer electronics, telecommunication, and electronic devices; (8) companies that sell grocery products online and in physical stores; and (9) companies that provide advertising services, whether in digital or other formats. We believe that the principal competitive factors in our retail businesses include selection, price, and convenience, including fast and reliable fulfillment. Additional competitive factors for our seller and enterprise services include the quality, speed, and reliability of our services and tools, as well as customers' ability and willingness to change business practices. Some of our current and potential competitors have greater resources, longer histories, more customers, greater brand recognition, and greater control over inputs critical to our various businesses. They may secure better terms from suppliers, adopt more aggressive pricing, pursue restrictive distribution agreements that restrict our access to supply, direct consumers to their own offerings instead of ours, lock-in potential customers with restrictive terms, and devote more resources to technology, infrastructure, fulfillment, and marketing. The internet facilitates competitive entry and comparison shopping, which enhances the ability of new, smaller, or lesser-known businesses to compete against us. Each of our businesses is also subject to rapid change and the development of new business models and the entry of new and well-funded competitors. Other companies also may enter into business combinations or alliances that strengthen their competitive positions. \n", + "\n", + "<
>
Intellectual Property
<
>\n", + "\n", + "We regard our trademarks, service marks, copyrights, patents, domain names, trade dress, trade secrets, proprietary technologies, and similar intellectual property as critical to our success, and we rely on trademark, copyright, and patent law, trade-secret protection, and confidentiality and/or license agreements with our employees, customers, partners, and others to protect our proprietary rights. We have registered, or applied for the registration of, a number of U.S. and international domain names, trademarks, service marks, and copyrights. Additionally, we have filed U.S. and international patent applications covering certain of our proprietary technology. \n", + "\n", + "Seasonality \n", + "\n", + "Our business is affected by seasonality, which historically has resulted in higher sales volume during our fourth quarter, which ends December 31. \n", + "\n", + "<
>
Human Capital
<
>\n", + "\n", + "Our employees are critical to our mission of being Earth's most customer-centric company. As of December 31, 2023, we employed approximately 1,525,000 full-time and part-time employees. Additionally, we use independent contractors and temporary personnel to supplement our workforce. Competition for qualified personnel is intense, particularly for software engineers, computer scientists, and other technical staff, and constrained labor markets have increased competition for personnel across other parts of our business. \n", + "\n", + "As we strive to be Earth's best employer, we focus on investment and innovation, inclusion and diversity, safety, and engagement to hire and develop the best talent. We rely on numerous and evolving initiatives to implement these objectives and invent mechanisms for talent development, including competitive pay and benefits, flexible work arrangements, and skills training and educational programs such as Amazon Career Choice (education funding for eligible employees) and the Amazon Technical Academy (software development engineer training). Over 175,000 Amazon employees around the world have participated in Career Choice. We also continue to inspect and refine the mechanisms we use to hire, develop, evaluate, and retain our employees to promote equity for all candidates and employees. In addition, safety is integral to everything we do at Amazon and we continue to invest in safety improvements such as capital improvements, new safety technology, vehicle safety controls, and engineering ergonomic solutions. Our safety team is dedicated to using the science of safety to solve complex problems and establish new industry best practices. We also provide mentorship and support resources to our employees, and have deployed numerous programs that advance employee engagement, communication, and feedback. \n", + "\n", + "\n" + ] + } + ], + "source": [ + "from textractor.data.text_linearization_config import TextLinearizationConfig\n", + "\n", + "config = TextLinearizationConfig(\n", + " hide_figure_layout=False,\n", + " title_prefix=\"<><title>\",\n", + " title_suffix=\"<>\",\n", + " hide_header_layout=True,\n", + " section_header_prefix=\"<
>
\",\n", + " section_header_suffix=\"
<
>\",\n", + " table_prefix=\"\",\n", + " table_suffix=\"
\",\n", + " list_layout_prefix=\"<>\",\n", + " list_layout_suffix=\"<>\",\n", + " hide_footer_layout=True,\n", + " hide_page_num_layout=True,\n", + ")\n", + "\n", + "print(document.pages[3].get_text(config=config))" + ] + }, + { + "cell_type": "markdown", + "id": "29a89081-022e-4451-b3be-38fd4f697e0d", + "metadata": {}, + "source": [ + "## Document Processing" + ] + }, + { + "cell_type": "markdown", + "id": "1b05fa55-fe4e-462e-9908-b8cae7d731d2", + "metadata": {}, + "source": [ + "This code snippet comprises a Python function `split_list_items_` and a script segment that processes a document containing tables and text, converting tables into CSV format and maintaining the document structure with text and tables.\n", + "\n", + "The function `split_list_items_` takes a string as input, likely representing a document with nested lists marked by specific XML tags. It parses this string, extracting items and handling nested lists appropriately. The function then returns a list containing the extracted items.\n", + "\n", + "The script segment following the function processes each page of the document. It identifies tables, converts them to CSV format, and wraps them with XML tags for identification. If lists are present in the document, the script utilizes the `split_list_items_` function to handle them. The processed content is stored in dictionaries for further use.\n", + "\n", + "The `layout_table_to_excel` loads a pandas dataframe in excel format to handle spanned columns/rows in complex tables. It duplicates the spanned row/columns value across corresponding spanned cells to help keep the intergrity of complex tables.\n", + "\n", + "This script segment efficiently manages document content, ensuring tables are properly formatted while preserving the document's structure with text and lists. It serves to handle data extraction and processing tasks involving documents with mixed content types." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "7211dead-9643-4afb-b9d0-cc59f905ce1b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "def strip_newline(cell):\n", + " \"\"\"\n", + " A utility function to strip newline characters from a cell.\n", + " Parameters:\n", + " cell (str): The cell value.\n", + " Returns:\n", + " str: The cell value with newline characters removed.\n", + " \"\"\"\n", + " return str(cell).strip()\n", + "\n", + "def layout_table_to_excel(document, ids,csv_seperator): \n", + " \"\"\"\n", + " Converts an Excel table from a document to a Pandas DataFrame, \n", + " handling duplicated values across merged cells.\n", + "\n", + " Args:\n", + " document: Document containing Excel table \n", + " ids: ID of the Excel table in the document\n", + " csv_seperator: Separator for CSV string conversion\n", + "\n", + " Returns: \n", + " Pandas DataFrame representation of the Excel table\n", + " \"\"\"\n", + " # save the table in excel format to preserve the structure of any merged cells\n", + " buffer = io.BytesIO() \n", + " document.tables[ids].to_excel(buffer)\n", + " buffer.seek(0)\n", + " # Load workbook, get active worksheet\n", + " wb = openpyxl.load_workbook(buffer)\n", + " worksheet = wb.active\n", + " # Unmerge cells, duplicate merged values to individual cells\n", + " all_merged_cell_ranges: list[CellRange] = list(\n", + " worksheet.merged_cells.ranges\n", + " )\n", + " for merged_cell_range in all_merged_cell_ranges:\n", + " merged_cell: Cell = merged_cell_range.start_cell\n", + " worksheet.unmerge_cells(range_string=merged_cell_range.coord)\n", + " for row_index, col_index in merged_cell_range.cells:\n", + " cell: Cell = worksheet.cell(row=row_index, column=col_index)\n", + " cell.value = merged_cell.value\n", + " # determine table header index\n", + " df = pd.DataFrame(worksheet.values)\n", + " df=df.map(strip_newline)\n", + " df0=df.to_csv(sep=csv_seperator,index=False, header=None)\n", + " row_count=len([x for x in df0.split(\"\\n\") if x])\n", + " if row_count>1:\n", + " if not all(value.strip() == '' for value in df0.split(\"\\n\")[0].split(csv_seperator)): \n", + " row_count=1\n", + " # attach table column names\n", + " column_row=0 if row_count==1 else 1\n", + " df.columns = df.iloc[column_row] \n", + " df = df[column_row+1:]\n", + " return df\n", + "\n", + "def split_list_items_(items):\n", + " \"\"\"\n", + " Splits the given string into a list of items, handling nested lists.\n", + "\n", + " Parameters:\n", + " items (str): The input string containing items and possibly nested lists.\n", + "\n", + " Returns:\n", + " list: A list containing the items extracted from the input string.\n", + " \"\"\"\n", + " parts = re.split(\"(<>|<>)\", items) \n", + " output = []\n", + "\n", + " inside_list = False\n", + " list_item = \"\"\n", + "\n", + " for p in parts:\n", + " if p == \"<>\":\n", + " inside_list = True \n", + " list_item=p\n", + " elif p == \"<>\":\n", + " inside_list = False\n", + " list_item += p\n", + " output.append(list_item)\n", + " list_item = \"\" \n", + " elif inside_list:\n", + " list_item += p.strip()\n", + " else:\n", + " output.extend(p.split('\\n'))\n", + " return output" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "a011ade1-bd57-49e3-a531-297e3c34d4f3", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import io\n", + "\"\"\"\n", + "This script processes a document containing tables and text. It converts the tables into CSV format \n", + "and wraps them with XML tags for easy identification. The document structure with text and tables is maintained.\n", + "\"\"\"\n", + "csv_seperator=\"|\" \n", + "document_holder={}\n", + "table_page={}\n", + "count=0\n", + "# Whether to handle merged cells by duplicating merged value across corresponding individual cells\n", + "unmerge_span_cells=True \n", + "# Loop through each page in the document\n", + "for ids,page in enumerate(document.pages):\n", + " table_count=len([word for word in page.get_text(config=config).split() if \"\" in word]) # get the number of table in the extracted document page by header we set earlier\n", + " assert table_count==len(page.tables) # check that number of tables per page is same as *tables extracted by textract TABLE feature\n", + " content=page.get_text(config=config).split(\"\")\n", + " document_holder[ids]=[] \n", + " for idx,item in enumerate(content):\n", + " if \"
\" in item: \n", + " if unmerge_span_cells:\n", + " df=layout_table_to_excel(document, count,csv_seperator)\n", + " else:\n", + " df0= document.tables[count].to_pandas(use_columns=False).to_csv(header=False, index=None,sep=csv_seperator)\n", + " row_count=len([x for x in df0.split(\"\\n\") if x]) #Check the number of rows in the parsed table to determine how to read the table headers. if table row count is 1 then headers is obviously at 0 else headers may or may not be at 0\n", + " #Check if the first row in the csv is empty headers\n", + " if row_count>1:\n", + " if not all(value.strip() == '' for value in df0.split(\"\\n\")[0].split(csv_seperator)): \n", + " row_count=1\n", + " df=pd.read_csv(io.StringIO(df0), sep=csv_seperator, \n", + " header=0 if row_count==1 else 1, keep_default_na=False) # read table with appropiate column headers\n", + " df.rename(columns=lambda x: '' if str(x).startswith('Unnamed:') else x, inplace=True) \n", + " table=df.to_csv(index=None, sep=csv_seperator)\n", + "\n", + " if ids in table_page:\n", + " table_page[ids].append(table)\n", + " else:\n", + " table_page[ids]=[table]\n", + " # Extract table data and remaining content\n", + " pattern = re.compile(r'
(.*?)(
)', re.DOTALL) \n", + " data=item\n", + " table_match = re.search(pattern, data)\n", + " table_data = table_match.group(1) if table_match else '' \n", + " remaining_content = data[table_match.end():] if table_match else data \n", + " content[idx]=f\"<>
{table}
<>\" ## attach xml tags to differentiate table from other text\n", + " count+=1\n", + " # Check for list items in remaining content\n", + " if \"<>\" in remaining_content:\n", + " output=split_list_items_(remaining_content)\n", + " output=[x.strip() for x in output if x.strip()]\n", + " document_holder[ids].extend([content[idx]]+output) \n", + " else:\n", + " document_holder[ids].extend([content[idx]]+[x.strip() for x in remaining_content.split('\\n') if x.strip()]) # split other text by new line to be independent items in the python list.\n", + " else: \n", + " # Check for list items and tables in remaining content\n", + " if \"<>\" in item and \"\" not in item: \n", + " output=split_list_items_(item)\n", + " output=[x.strip() for x in output if x.strip()]\n", + " document_holder[ids].extend(output)\n", + " else:\n", + " document_holder[ids].extend([x.strip() for x in item.split(\"\\n\") if x.strip()])" + ] + }, + { + "cell_type": "markdown", + "id": "f8f4f4ba-e1f4-4f9d-a7ec-49f9eb3ead44", + "metadata": {}, + "source": [ + "Here we first flatten a nested list into a single list and then join its elements using newline characters. Subsequently, the string is split into segments based on the `` tag (split by title section hierarchy), generating a list of sub-section segments. Following this, the function `sub_header_content_splitta` is defined to process a string, splitting it by XML tags and extracting text segments, excluding segments containing specific XML tags such as `
`, ``, or `
`. This function takes a string as input, applies a regular expression pattern to split it by XML tags, and iterates through the resulting segments to filter out those containing the specified XML tags. The extracted text segments are then returned as a list. " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "21eb2fde-7547-4390-831b-5c13b198d775", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# # Flatten the nested list document_holder into a single list and Join the flattened list by \"\\n\"\n", + "flattened_list = [item for sublist in document_holder.values() for item in sublist]\n", + "result = \"\\n\".join( flattened_list)\n", + "header_split=result.split(\"\")\n", + "\n", + "def sub_header_content_splitta(string): \n", + " \"\"\"\n", + " Splits the input string by XML tags and returns a list containing the segments of text,\n", + " excluding segments containing specific XML tags such as \"
\", \"\", or \"
\".\n", + "\n", + " Parameters:\n", + " string (str): The input string to be processed.\n", + "\n", + " Returns:\n", + " list: A list containing the segments of text extracted from the input string.\n", + " \"\"\" \n", + " pattern = re.compile(r'<<[^>]+>>')\n", + " segments = re.split(pattern, string)\n", + " result = []\n", + " for segment in segments:\n", + " if segment.strip():\n", + " if \"
\" not in segment and \"\" not in segment and \"
\" not in segment:\n", + " segment=[x.strip() for x in segment.split('\\n') if x.strip()]\n", + " result.extend(segment)\n", + " else:\n", + " result.append(segment)\n", + " return result\n" + ] + }, + { + "cell_type": "markdown", + "id": "29dd9cf4-ee87-4233-bd12-01f50c00b8b7", + "metadata": {}, + "source": [ + "## Document Chunking\n", + "\n", + "Example Page showing the hierarachy of elements in a document. Hierarchy flows down from **Section Title** -> **Section Header** -> **Paragraphs**, where the section title is a super set of the section header and paragraphs/tables/lists etc. \n", + "\n", + "\n", + "\n", + "This cell iterates through the document per title section and chunks content within each sub-sections in the following manner:\n", + "- It uses number of words as chunking threshold.\n", + "- It looks for the different xml tags to identify the different document elements. These elements includes `section titles`, `section headers`, `tables`, `lists` and `paragraphs`.\n", + " - Iterating through the various section headers within a section title identified by the **header** tags and only chunking contents within each section header. Therefore, chunks do not overflow to a different section header even if the word threshold has not been met. This helps us create a hierarchial mapping of each chunk to its parent entities (section header and section title).\n", + " - If a table xml tag is found, it checks if there is a sentence before that table (the heueristics employed here is that the sentence before a table is usually the table header) and use it as table headers. It then splits table by rows until desired chunk is achieved and appends the corresponding section header and table column names to the table chunk.\n", + " - If a list is found, split list by items until desired chunk is achieved. Employ same heuristics as above and append list headers to all list chunk.\n", + " - For other text, it chunks by paragraphs and appends each section header name to the corresponding chunks.\n", + "- A dicionary containing each complete the hierarchial relationship of each chunk to its corresponding parent entities is stored as a JSON object to be used for hierarchial retrieval a.k.a **Small-to-Big**.\n", + "- The complete table and list found in each chunk is also stored for metadata purposes." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "919c8154-00b5-466b-b3f6-54f91cde19fb", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import re\n", + "import pandas as pd\n", + "from io import StringIO\n", + "\n", + "max_words = 200\n", + "chunks = {}\n", + "table_header_dict={} \n", + "chunk_header_mapping={}\n", + "list_header_dict={}\n", + "\n", + "# iterate through each title section\n", + "for title_ids, items in enumerate(header_split):\n", + " title_chunks = []\n", + " current_chunk = []\n", + " num_words = 0 \n", + " table_header_dict[title_ids]={}\n", + " chunk_header_mapping[title_ids]={}\n", + " list_header_dict[title_ids]={}\n", + " chunk_counter=0\n", + " for item_ids,item in enumerate(items.split('')): \n", + " lines=sub_header_content_splitta(item) \n", + " SECTION_HEADER=None \n", + " TITLES=None\n", + " num_words = 0 \n", + " for ids_line,line in enumerate(lines):\n", + " \n", + " if line.strip():\n", + " if \"\" in line: \n", + " TITLES=re.findall(r'<title>(.*?)', line)[0].strip()\n", + " line=TITLES \n", + " if re.sub(r'<[^>]+>', '', \"\".join(lines)).strip()==TITLES:\n", + " chunk_header_mapping[title_ids][chunk_counter]=lines\n", + " chunk_counter+=1\n", + " if \"
\" in line: \n", + " SECTION_HEADER=re.findall(r'
(.*?)
', line)[0].strip()\n", + " line=SECTION_HEADER \n", + " first_header_portion=True\n", + " next_num_words = num_words + len(re.findall(r'\\w+', line)) \n", + "\n", + " if \"
\" not in line and \"\" not in line:\n", + " if next_num_words > max_words and \"\".join(current_chunk).strip()!=SECTION_HEADER and current_chunk and \"\".join(current_chunk).strip()!=TITLES:\n", + " \n", + " if SECTION_HEADER :\n", + " if first_header_portion:\n", + " first_header_portion=False \n", + " else:\n", + " current_chunk.insert(0, SECTION_HEADER.strip()) \n", + " \n", + " title_chunks.append(current_chunk) \n", + " chunk_header_mapping[title_ids][chunk_counter]=lines\n", + " \n", + " current_chunk = []\n", + " num_words = 0 \n", + " chunk_counter+=1\n", + " \n", + " current_chunk.append(line) \n", + " num_words += len(re.findall(r'\\w+', line))\n", + "\n", + " \"\"\"\n", + " Goal is to segment out table items and chunks intelligently.\n", + " We chunk the table by rows and for each chunk of the table we append the table column headers\n", + " and table headers if any. This way we preserve the table information across each chunks.\n", + " This will help improve semantic search where all the chunks relating to a table would be in the \n", + " top k=n response giving the LLM mcomplet information on the table.\n", + " \"\"\"\n", + "\n", + " if \"
\" in line:\n", + " # Get table header which is usually line before table in document \n", + " line_index=lines.index(line)\n", + " if line_index!=0 and \"
\" not in lines[line_index-1] and \"\" not in lines[line_index-1]: #Check if table is first item on the page, then they wont be a header (header may be included it table) and also if table is the the last item in the list\n", + " header=lines[line_index-1].replace(\"
\",\"\").replace(\"
\",\"\")\n", + " else:\n", + " header=\"\" \n", + " \n", + " table = line.split(\"
\")[-1].split(\"
\")[0] # get table from demarcators \n", + " df=pd.read_csv(io.StringIO(table), sep=csv_seperator, keep_default_na=False,header=None)\n", + " df.columns = df.iloc[0]\n", + " df = df[1:]\n", + " df.rename(columns=lambda x: '' if str(x).startswith('Unnamed:') else x, inplace=True) \n", + " table_chunks = []\n", + " curr_chunk = [df.columns.to_list()] #start current chunk with table column names \n", + " words=len(re.findall(r'\\w+', str(current_chunk)+\" \"+str(curr_chunk))) \n", + " # Iterate through the rows in the table\n", + " for row in df.itertuples(index=False):\n", + " curr_chunk.append(row) \n", + " words+=len(re.findall(r'\\w+', str(row)))\n", + " if words > max_words: \n", + " if [x for x in table_header_dict[title_ids] if chunk_counter == x]:\n", + " table_header_dict[title_ids][chunk_counter].extend([header]+[table])\n", + " else:\n", + " table_header_dict[title_ids][chunk_counter]=[header]+[table] \n", + " table_chunks.append(\"\\n\".join([csv_seperator.join(str(x) for x in curr_chunk[0])] + [csv_seperator.join(str(x) for x in r) for r in curr_chunk[1:]])) #join chunk lines together to for a csv \n", + " tab_chunk=\"\\n\".join([csv_seperator.join(str(x) for x in curr_chunk[0])] + [csv_seperator.join(str(x) for x in r) for r in curr_chunk[1:]]) #join chunk lines together to for a csv\n", + " words = len(re.findall(r'\\w+', str(curr_chunk[0]))) # set word count to word length of column header names\n", + " if header: #If header attach header to table \n", + " if current_chunk and current_chunk[-1].strip().lower()==header.strip().lower(): #check if header is in the chunk and remove to avoid duplicacy of header in chunk \n", + " current_chunk.pop(-1)\n", + " # Append section header to table\n", + " if SECTION_HEADER and SECTION_HEADER.lower().strip() != header.lower().strip():\n", + " if first_header_portion:\n", + " first_header_portion=False\n", + " else:\n", + " current_chunk.insert(0, SECTION_HEADER.strip()) \n", + " current_chunk.extend([header.strip()+':' if not header.strip().endswith(':') else header.strip() ]+[tab_chunk]) #enrich table header with ':'\n", + " title_chunks.append(current_chunk) \n", + " \n", + " else:\n", + " if SECTION_HEADER:\n", + " if first_header_portion:\n", + " first_header_portion=False\n", + " else:\n", + " current_chunk.insert(0, SECTION_HEADER.strip()) \n", + " current_chunk.extend([tab_chunk])\n", + " title_chunks.append(current_chunk) \n", + " chunk_header_mapping[title_ids][chunk_counter]=lines\n", + " chunk_counter+=1\n", + " num_words=0\n", + " current_chunk=[]\n", + " curr_chunk = [curr_chunk[0]]\n", + " \n", + " if curr_chunk != [df.columns.to_list()] and lines.index(line) == len(lines)-1: #if table chunk still remaining and table is last item in page append as last chunk\n", + " table_chunks.append(\"\\n\".join([csv_seperator.join(str(x) for x in curr_chunk[0])] + [csv_seperator.join(str(x) for x in r) for r in curr_chunk[1:]]))\n", + " tab_chunk=\"\\n\".join([csv_seperator.join(str(x) for x in curr_chunk[0])] + [csv_seperator.join(str(x) for x in r) for r in curr_chunk[1:]]) \n", + " if [x for x in table_header_dict[title_ids] if chunk_counter == x]:\n", + " table_header_dict[title_ids][chunk_counter].extend([header]+[table])\n", + " else:\n", + " table_header_dict[title_ids][chunk_counter]=[header]+[table] \n", + " \n", + " if header: \n", + " if current_chunk and current_chunk[-1].strip().lower()==header.strip().lower():#check if header is in the chunk and remove to avoid duplicacy of header in chunk\n", + " current_chunk.pop(-1) \n", + " if SECTION_HEADER and SECTION_HEADER.lower().strip() != header.lower().strip():\n", + " if first_header_portion:\n", + " first_header_portion=False\n", + " else:\n", + " current_chunk.insert(0, SECTION_HEADER.strip()) \n", + " current_chunk.extend([header.strip()+':' if not header.strip().endswith(':') else header.strip() ]+[tab_chunk])\n", + " title_chunks.append(current_chunk) \n", + " else:\n", + " if SECTION_HEADER:\n", + " if first_header_portion:\n", + " first_header_portion=False\n", + " else:\n", + " current_chunk.insert(0, SECTION_HEADER.strip()) \n", + " current_chunk.extend([tab_chunk])\n", + " title_chunks.append(current_chunk) \n", + " chunk_header_mapping[title_ids][chunk_counter]=lines\n", + " chunk_counter+=1\n", + " num_words=0\n", + " current_chunk=[]\n", + " elif curr_chunk != [df.columns.to_list()] and lines.index(line) != len(lines)-1: #if table is not last item in page and max word threshold is not reached, send no next loop\n", + " table_chunks.append(\"\\n\".join([csv_seperator.join(str(x) for x in curr_chunk[0])] + [csv_seperator.join(str(x) for x in r) for r in curr_chunk[1:]]))\n", + " tab_chunk=\"\\n\".join([csv_seperator.join(str(x) for x in curr_chunk[0])] + [csv_seperator.join(str(x) for x in r) for r in curr_chunk[1:]])\n", + " \n", + " if [x for x in table_header_dict[title_ids] if chunk_counter == x]:\n", + " table_header_dict[title_ids][chunk_counter].extend([header]+[table])\n", + " else:\n", + " table_header_dict[title_ids][chunk_counter]=[header]+[table] \n", + " if header: \n", + " if current_chunk and current_chunk[-1].strip().lower()==header.strip().lower():#check if header is in the chunk and remove to avoid duplicacy of header in chunk\n", + " current_chunk.pop(-1) \n", + " current_chunk.extend([header.strip()+':' if not header.strip().endswith(':') else header.strip() ]+[tab_chunk])\n", + " else:\n", + " current_chunk.extend([tab_chunk]) \n", + " num_words=words\n", + " \n", + "\n", + " \"\"\"\n", + " Goal is to segment out list items and chunk intelligently.\n", + " We chunk each list by items in the list and \n", + " for each list chunk we append the list header to the chunk to preserve the information of the list across chunks.\n", + " This would boost retrieval process where question pertaining to a list will have all list chunks within\n", + " the topK=n responses.\n", + " \"\"\"\n", + "\n", + " if \"\" in line:\n", + " # Get list header which is usually line before list in document\n", + " line_index=lines.index(line)\n", + " if line_index!=0 and \"\" not in lines[line_index-1] and \"\" not in lines[line_index-1]: #Check if table or list is the previous item on the page, then they wont be a header\n", + " header=lines[line_index-1].replace(\"
\",\"\").replace(\"
\",\"\")\n", + " else:\n", + " header=\"\" \n", + " list_pattern = re.compile(r'(.*?)(?:|$)', re.DOTALL) ## Grab all list contents within the list xml tags \n", + " list_match = re.search(list_pattern, line)\n", + " list_ = list_match.group(1)\n", + " list_lines=list_.split(\"\\n\") \n", + "\n", + " curr_chunk = [] \n", + " words=len(re.findall(r'\\w+', str(current_chunk))) #start word count from any existing chunk\n", + " # Iterate through the items in the list\n", + " for lyst_item in list_lines:\n", + " curr_chunk.append(lyst_item) \n", + " words+=len(re.findall(r'\\w+', lyst_item)) \n", + " if words >= max_words: # \n", + " if [x for x in list_header_dict[title_ids] if chunk_counter == x]:\n", + " list_header_dict[title_ids][chunk_counter].extend([header]+[list_])\n", + " else:\n", + " list_header_dict[title_ids][chunk_counter]=[header]+[list_] \n", + " words=0 \n", + " list_chunk=\"\\n\".join(curr_chunk)\n", + " if header: # attach list header \n", + " if current_chunk and current_chunk[-1].strip().lower()==header.strip().lower():#check if header is in the chunk and remove to avoid duplicacy of header in chunk \n", + " current_chunk.pop(-1) \n", + " # Append section content header to list\n", + " if SECTION_HEADER and SECTION_HEADER.lower().strip() != header.lower().strip():\n", + " if first_header_portion:\n", + " first_header_portion=False\n", + " else:\n", + " current_chunk.insert(0, SECTION_HEADER.strip())\n", + " \n", + " current_chunk.extend([header.strip()+':' if not header.strip().endswith(':') else header.strip() ]+[list_chunk]) \n", + " title_chunks.append(current_chunk) \n", + " \n", + " else:\n", + " if SECTION_HEADER:\n", + " if first_header_portion:\n", + " first_header_portion=False\n", + " else:\n", + " current_chunk.insert(0, SECTION_HEADER.strip())\n", + " \n", + " current_chunk.extend([list_chunk])\n", + " title_chunks.append(current_chunk) \n", + " chunk_header_mapping[title_ids][chunk_counter]=lines\n", + " chunk_counter+=1\n", + " num_words=0\n", + " current_chunk=[]\n", + " curr_chunk = []\n", + " if curr_chunk and lines.index(line) == len(lines)-1: #if list chunk still remaining and list is last item in page append as last chunk\n", + " list_chunk=\"\\n\".join(curr_chunk)\n", + " if [x for x in list_header_dict[title_ids] if chunk_counter == x]:\n", + " list_header_dict[title_ids][chunk_counter].extend([header]+[list_])\n", + " else:\n", + " list_header_dict[title_ids][chunk_counter]=[header]+[list_] \n", + " if header: \n", + " if current_chunk and current_chunk[-1].strip().lower()==header.strip().lower(): #check if header is in the chunk and remove to avoid duplicacy of header in chunk\n", + " current_chunk.pop(-1) \n", + " if SECTION_HEADER and SECTION_HEADER.lower().strip() != header.lower().strip():\n", + " if first_header_portion:\n", + " first_header_portion=False\n", + " else:\n", + " current_chunk.insert(0, SECTION_HEADER.strip()) \n", + " current_chunk.extend([header.strip()+':' if not header.strip().endswith(':') else header.strip() ]+[list_chunk])\n", + " title_chunks.append(current_chunk) \n", + " else:\n", + " if SECTION_HEADER:\n", + " if first_header_portion:\n", + " first_header_portion=False\n", + " else:\n", + " current_chunk.insert(0, SECTION_HEADER.strip()) \n", + " current_chunk.extend([list_chunk])\n", + " title_chunks.append(current_chunk) \n", + " chunk_header_mapping[title_ids][chunk_counter]=lines\n", + " chunk_counter+=1\n", + " num_words=0\n", + " current_chunk=[]\n", + " elif curr_chunk and lines.index(line) != len(lines)-1: #if list is not last item in page and max word threshold is not reached, send to next loop \n", + " list_chunk=\"\\n\".join(curr_chunk)\n", + " if [x for x in list_header_dict[title_ids] if chunk_counter == x]:\n", + " list_header_dict[title_ids][chunk_counter].extend([header]+[list_])\n", + " else:\n", + " list_header_dict[title_ids][chunk_counter]=[header]+[list_] \n", + " if header: \n", + " if current_chunk and current_chunk[-1].strip().lower()==header.strip().lower():#check if header is in the chunk and remove to avoid duplicacy of header in chunk\n", + " current_chunk.pop(-1) \n", + " current_chunk.extend([header.strip()+':' if not header.strip().endswith(':') else header.strip() ]+[list_chunk])\n", + " else:\n", + " current_chunk.extend([list_chunk]) \n", + " num_words=words\n", + "\n", + "\n", + " if current_chunk and \"\".join(current_chunk).strip()!=SECTION_HEADER and \"\".join(current_chunk).strip()!=TITLES:\n", + " \n", + " if SECTION_HEADER:\n", + " if first_header_portion:\n", + " first_header_portion=False\n", + " else:\n", + " current_chunk.insert(0, SECTION_HEADER.strip()) \n", + " title_chunks.append(current_chunk)\n", + " chunk_header_mapping[title_ids][chunk_counter]=lines\n", + " current_chunk=[]\n", + " chunk_counter+=1\n", + " if current_chunk:\n", + " \n", + " title_chunks.append(current_chunk) \n", + " chunk_header_mapping[title_ids][chunk_counter]=lines\n", + " chunks[title_ids] = title_chunks" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "9501b756-ba00-48f7-a400-49bae1d834d6", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Chunk 1:\n", + "AMAZON.COM, INC. CONSOLIDATED STATEMENTS OF CASH FLOWS (in millions)\n", + "AMAZON.COM, INC. CONSOLIDATED STATEMENTS OF CASH FLOWS (in millions) :\n", + "|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\n", + "|2021|2022|2023\n", + "CASH, CASH EQUIVALENTS, AND RESTRICTED CASH, BEGINNING OF PERIOD|$ 42,377|$ 36,477|$ 54,253\n", + "OPERATING ACTIVITIES:|OPERATING ACTIVITIES:|OPERATING ACTIVITIES:|OPERATING ACTIVITIES:\n", + "Net income (loss)|33,364|(2,722)|30,425\n", + "Adjustments to reconcile net income (loss) to net cash from operating activities:|Adjustments to reconcile net income (loss) to net cash from operating activities:|Adjustments to reconcile net income (loss) to net cash from operating activities:|Adjustments to reconcile net income (loss) to net cash from operating activities:\n", + "Depreciation and amortization of property and equipment and capitalized content costs, operating lease assets, and other|34,433|41,921|48,663\n", + "Stock-based compensation|12,757|19,621|24,023\n", + "Non-operating expense (income), net|(14,306)|16,966|(748)\n", + "Deferred income taxes|(310)|(8,148)|(5,876)\n", + "Changes in operating assets and liabilities:|Changes in operating assets and liabilities:|Changes in operating assets and liabilities:|Changes in operating assets and liabilities:\n", + "\n", + "\n", + "Chunk 2:\n", + "AMAZON.COM, INC. CONSOLIDATED STATEMENTS OF CASH FLOWS (in millions) :\n", + "|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\n", + "Inventories|(9,487)|(2,592)|1,449\n", + "Accounts receivable, net and other|(9,145)|(8,622)|(8,348)\n", + "Other assets|(9,018)|(13,275)|(12,265)\n", + "Accounts payable|3,602|2,945|5,473\n", + "Accrued expenses and other|2,123|(1,558)|(2,428)\n", + "Unearned revenue|2,314|2,216|4,578\n", + "Net cash provided by (used in) operating activities|46,327|46,752|84,946\n", + "INVESTING ACTIVITIES:|INVESTING ACTIVITIES:|INVESTING ACTIVITIES:|INVESTING ACTIVITIES:\n", + "Purchases of property and equipment|(61,053)|(63,645)|(52,729)\n", + "Proceeds from property and equipment sales and incentives|5,657|5,324|4,596\n", + "Acquisitions, net of cash acquired, non-marketable investments, and other|(1,985)|(8,316)|(5,839)\n", + "Sales and maturities of marketable securities|59,384|31,601|5,627\n", + "Purchases of marketable securities|(60,157)|(2,565)|(1,488)\n", + "\n", + "\n", + "Chunk 3:\n", + "AMAZON.COM, INC. CONSOLIDATED STATEMENTS OF CASH FLOWS (in millions) :\n", + "|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\n", + "Net cash provided by (used in) investing activities|(58,154)|(37,601)|(49,833)\n", + "FINANCING ACTIVITIES:|FINANCING ACTIVITIES:|FINANCING ACTIVITIES:|FINANCING ACTIVITIES:\n", + "Common stock repurchased|-|(6,000)|-\n", + "Proceeds from short-term debt, and other|7,956|41,553|18,129\n", + "Repayments of short-term debt, and other|(7,753)|(37,554)|(25,677)\n", + "Proceeds from long-term debt|19,003|21,166|-\n", + "Repayments of long-term debt|(1,590)|(1,258)|(3,676)\n", + "Principal repayments of finance leases|(11,163)|(7,941)|(4,384)\n", + "Principal repayments of financing obligations|(162)|(248)|(271)\n", + "Net cash provided by (used in) financing activities|6,291|9,718|(15,879)\n", + "Foreign currency effect on cash, cash equivalents, and restricted cash|(364)|(1,093)|403\n", + "Net increase (decrease) in cash, cash equivalents, and restricted cash|(5,900)|17,776|19,637\n", + "\n", + "\n", + "Chunk 4:\n", + "AMAZON.COM, INC. CONSOLIDATED STATEMENTS OF CASH FLOWS (in millions) :\n", + "|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\n", + "CASH, CASH EQUIVALENTS, AND RESTRICTED CASH, END OF PERIOD|$ 36,477|$ 54,253|$ 73,890\n", + "See accompanying notes to consolidated financial statements.\n", + "\n", + "\n" + ] + } + ], + "source": [ + "# Print chunks per title section\n", + "for i, chunk in enumerate(chunks[8][:10], start=1):\n", + " print(f'Chunk {i}:')\n", + " for item in chunk:\n", + " print(item)\n", + " print('\\n')" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "c665aeba-cb90-40c4-b40b-367cd5874a57", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "FORM 10-K \n", + "FORM 10-K For the Fiscal Year Ended December 31, 2023 \n", + "PART II \n", + "Item 7. Management's Discussion and Analysis of Financial Condition and Results of Operations \n", + "Table of Contents \n", + "INDEX TO CONSOLIDATED FINANCIAL STATEMENTS \n", + "Report of Independent Registered Public Accounting Firm \n", + "AMAZON.COM, INC. CONSOLIDATED STATEMENTS OF CASH FLOWS (in millions) \n", + "CONSOLIDATED STATEMENTS OF OPERATIONS \n", + "AMAZON.COM, INC. CONSOLIDATED STATEMENTS OF COMPREHENSIVE INCOME (LOSS) (in millions) \n", + "CONSOLIDATED BALANCE SHEETS \n", + "AMAZON.COM, INC. CONSOLIDATED STATEMENTS OF STOCKHOLDERS' EQUITY (in millions) \n", + "AMAZON.COM, INC. NOTES TO CONSOLIDATED FINANCIAL STATEMENTS \n", + "Note 2 - FINANCIAL INSTRUMENTS \n", + "Report of Independent Registered Public Accounting Firm \n", + "PART III \n", + "PART IV \n", + "SIGNATURES \n", + "AMAZON.COM, INC. GLOBAL RESTRICTED STOCK UNIT AWARD AGREEMENT \n", + "ACCEPTANCE AND ACKNOWLEDGMENT \n", + "GLOBAL RESTRICTED STOCK UNIT AWARD AGREEMENT \n", + "LIST OF SIGNIFICANT SUBSIDIARIES \n", + "Consent of Independent Registered Public Accounting Firm \n", + "CERTIFICATIONS \n", + "CERTIFICATIONS \n", + "Certification Pursuant to 18 U.S.C. Section 1350 \n", + "Certification Pursuant to 18 U.S.C. Section 1350 \n", + "CLAWBACK POLICY \n" + ] + } + ], + "source": [ + "# List of title header sections names document was split into\n", + "for x in chunk_header_mapping:\n", + " if chunk_header_mapping[x]:\n", + " try:\n", + " title_pattern = re.compile(r'(.*?)(?:|$)', re.DOTALL) \n", + " title_match = re.search(title_pattern, chunk_header_mapping[x][0][0])\n", + " title_ = title_match.group(1) if title_match else \"\"\n", + " print(title_, end='\\n')\n", + " except:\n", + " continue" + ] + }, + { + "cell_type": "markdown", + "id": "8123dba4-c57c-4fe2-85f5-b17b4bf0d29b", + "metadata": {}, + "source": [ + "Upload section contents (title and headers for each chunk) to s3" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "445e5827-3821-4c59-9e02-64c06db499e4", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "with open (f\"{doc_id}.json\", \"w\") as f:\n", + " json.dump(chunk_header_mapping,f)\n", + "s3.upload_file(f\"{doc_id}.json\", BUCKET, f\"{doc_id}.json\")" + ] + }, + { + "cell_type": "markdown", + "id": "fec2ac60-137c-4679-a0de-446e9f531008", + "metadata": {}, + "source": [ + "## Indexing" + ] + }, + { + "cell_type": "markdown", + "id": "6b5f93e9-4813-4b4e-a040-87e2370ec64b", + "metadata": {}, + "source": [ + "\n", + "Here's a sample script for indexing document chunks into an [Amazon OpenSearch Serverless](https://aws.amazon.com/blogs/big-data/introducing-the-vector-engine-for-amazon-opensearch-serverless-now-in-preview/).\n", + "\n", + "This code block establishes an index within an Amazon OpenSearch Service (Provisioned Capacity) and proceeds to index the document chunks. The index mapping incorporates metadata fields such as: \n", + "- document name, \n", + "- complete chunk tables, \n", + "- header section IDs, \n", + "- complete chunk list,\n", + "- and title section IDs. \n", + "\n", + "However you can decide what metadata is useful to keep as part of the index.\n", + "\n", + "The section ID's are used to map the corresponding passages (chunks) to their higher hierarchy content including the complete section headers the chunk is a part of and the complete section titles the chunks is a part of. This is a an advanced retrieval technique called **Small-to-Big** where a child chunk (passages) are used to retrieve parent chunks (section headers or titles) in situation where more context is needed and flow of information is to be preserved. These parent chunks can be stored in the same opensearch domain or a different storage system. This implementation uses Amazon S3 for storing the parent chunks.\n", + "\n", + "The complete chunk table and list are also additional information indexed with the chunks to provide flexibility in context retrieval to provide more information in addition to the retieved passages that may contain part of this document elements.\n", + "\n", + "We utilize an embedding model to generate embeddings and subsequently index them. The provided options in this implementation uses `Amazon titan Embedding`. The ouptut vector size for the titan embedding used for this implementation is 1024, however, Amazon titan v2 supports 256 and 512 output vector sizes which helps in reducing the size of you opensearch index with a minimal reduction in accuracy.\n", + "\n", + "**Note:** Certain chunks may exceed the threshold set for chunking in the previous cells due to the way tables are chunked by row and section paragraph sizes. This might result in a token limit exceed error for certain embedding models.\n", + "\n", + "Ensure to replace the **domain_endpoint** variable with the Amazon OpenSearch Service domain (2.11 and higher) or Serverless collection you created in your account.\n", + "\n", + "If using Amazon Opensearch Serverless, change the `openserach_serverless` to True." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f459d61-102c-44c4-8b65-245b74c5d883", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth\n", + "from requests_aws4auth import AWS4Auth\n", + "\n", + "\"\"\"\n", + "This script demonstrates indexing documents into an Amazon OpenSearch Serverless domain using AWS Identity and Access Management (IAM) for authentication.\n", + "\"\"\"\n", + "service = 'aoss'\n", + "# replace wit your OpenSearch Service domain/Serverless endpoint\n", + "domain_endpoint = host\n", + "\n", + "credentials = boto3.Session().get_credentials()\n", + "awsauth = AWSV4SignerAuth(credentials, \"us-east-1\", service)\n", + "os_ = OpenSearch(\n", + " hosts = [{'host': domain_endpoint, 'port': 443}],\n", + " http_auth = awsauth,\n", + " use_ssl = True,\n", + " verify_certs = True,\n", + " timeout=120, \n", + " # http_compress = True, # enables gzip compression for request bodies\n", + " connection_class = RequestsHttpConnection\n", + ")\n", + "\n", + "# Sample Opensearch domain index mapping\n", + "mapping = {\n", + " 'settings': {\n", + " 'index': { \n", + " 'knn': True,\n", + " \"knn.algo_param.ef_search\": 100, \n", + " }\n", + " },\n", + "\n", + " 'mappings': { \n", + " 'properties': {\n", + " 'embedding': {\n", + " 'type': 'knn_vector', \n", + " 'dimension':1024, #change as per sequence length of Embedding Model\n", + " \"method\": {\n", + " \"name\": \"hnsw\", \n", + " \"space_type\": \"cosinesimil\",\n", + " \"engine\": \"nmslib\",\n", + " \"parameters\": {\n", + " \"ef_construction\": 256,\n", + " \"m\": 48\n", + " }\n", + " }\n", + " },\n", + "\n", + " 'passage': {\n", + " 'type': 'text'\n", + " },\n", + "\n", + " 'doc_id': {\n", + " 'type': 'keyword'\n", + " },\n", + " \n", + " 'table': {\n", + " 'type': 'text'\n", + " },\n", + " \n", + " 'list': {\n", + " 'type': 'text'\n", + " },\n", + " 'section_header_ids': {\n", + " 'type': 'text'\n", + " },\n", + " 'section_title_ids': {\n", + " 'type': 'text'\n", + " },\n", + "\n", + " }\n", + " }\n", + " }\n", + "\n", + "domain_index =f\"test-index\" #domain index name\n", + "\n", + "if not os_.indices.exists(index=domain_index): \n", + " os_.indices.create(index=domain_index, body=mapping)\n", + " # Verify that the index has been created\n", + " if os_.indices.exists(index=domain_index):\n", + " print(f\"Index {domain_index} created successfully.\")\n", + " else:\n", + " print(f\"Failed to create index '{domain_index}'.\")\n", + "else:\n", + " print(f'{domain_index} Index already exists!')\n", + "\n", + "i = 1\n", + "SAGEMAKER=boto3.client('sagemaker-runtime')\n", + "for ids, chunkks in chunks.items(): # Iterate through the page title chunks \n", + " index_adjuster=len(chunk_header_mapping[ids])%len(chunkks)\n", + " for chunk_ids,chunk in enumerate(chunkks): # iterating through section header chunks \n", + " chunk_ids+=index_adjuster\n", + " passage_chunk=\"\\n\".join(chunk).replace(\"\",\"\").replace(\"\",\"\")\n", + " if passage_chunk.strip():\n", + " embedding=_get_emb_(passage_chunk) \n", + " table=[]\n", + " if ids in table_header_dict:\n", + " if [x for x in table_header_dict[ids] if x ==chunk_ids]: \n", + " table=\"\\n\".join(table_header_dict[ids][chunk_ids])\n", + " lists=[]\n", + " if ids in list_header_dict:\n", + " if [x for x in list_header_dict[ids] if x ==chunk_ids]: \n", + " lists=\"\\n\".join(list_header_dict[ids][chunk_ids])\n", + " documentt = { \n", + " 'doc_id':doc_id, #doc name \n", + " 'passage': passage_chunk,\n", + " 'embedding': embedding,\n", + " 'table':table,\n", + " \"list\":lists, \n", + " \"section_header_ids\":chunk_ids, #Store id of the header section\n", + " \"section_title_ids\":ids #Store id of the title section\n", + " }\n", + "\n", + " try:\n", + " response = os_.index(index=domain_index, body=documentt)\n", + " i += 1\n", + " # Check the response to see if the indexing was successful\n", + " if response[\"result\"] == \"created\":\n", + " print(f\"Document indexed successfully with ID: {response['_id']}\")\n", + " else:\n", + " print(\"Failed to index document.\")\n", + " except RequestError as e:\n", + " logging.error(f\"Error indexing document to index '{domain_index}': {e}\")\n", + " else:\n", + " continue " + ] + }, + { + "cell_type": "markdown", + "id": "bfb7c3cc-beb5-48e2-abaa-66b46c5e8373", + "metadata": {}, + "source": [ + "# RAG" + ] + }, + { + "cell_type": "markdown", + "id": "c16e8791-8b4f-4999-b42a-b89d3a7d3d01", + "metadata": {}, + "source": [ + "Custom approach to combine lexical and keyword search from OpenSearch Serverless as this is not natively intergrated in the service" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "beb35407-0b7b-4858-97dc-4c4b1baa4e27", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "def normalize_scores_(scores,normalizer):\n", + " \"\"\"\n", + " Normalize scores using L2/min-max normalization.\n", + " :param scores: The list of scores to normalize.\n", + " :param mormalizer: normalizing tekniq\n", + " :return: The normalized scores.\n", + " \"\"\"\n", + " if \"minmax\" in normalizer:\n", + " scores = np.array(scores)\n", + " return (scores - np.min(scores)) / (np.max(scores) - np.min(scores))\n", + " elif \"l2\" in normalizer:\n", + " scores = np.array(scores)\n", + " return scores / np.linalg.norm(scores)\n", + " else:\n", + " raise \"enter either minmax or l2 as normalizer\"\n", + " \n", + "def interpolate_scores(lexical_score, semantic_score, alpha=0.5):\n", + " \"\"\"\n", + " Interpolate lexical and semantic scores using a weighted sum.\n", + " :param lexical_score: The normalized score from the lexical search.\n", + " :param semantic_score: The normalized score from the semantic search.\n", + " :param alpha: The interpolation weight (default: 0.5).\n", + " :return: The interpolated score.\n", + " \"\"\"\n", + " return alpha * lexical_score + (1 - alpha) * semantic_score\n", + "\n", + "def reciprocal_rank_fusion(lexical_results, semantic_results, k=60):\n", + " \"\"\"\n", + " Combine lexical and semantic search results using Reciprocal Rank Fusion (RRF).\n", + " :param lexical_results: The results from the lexical search.\n", + " :param semantic_results: The results from the semantic search.\n", + " :param k: The parameter for RRF (default: 60).\n", + " :return: The combined search results.\n", + " \"\"\"\n", + " combined_results = {}\n", + "\n", + " for hit in lexical_results['hits']['hits']:\n", + " doc_id = hit['_id']\n", + " if doc_id not in combined_results:\n", + " combined_results[doc_id] = {'_id': doc_id, '_source': hit['_source'], '_score': 0}\n", + " combined_results[doc_id]['_score'] += 1 / (k + hit['_score'])\n", + "\n", + " for hit in semantic_results['hits']['hits']:\n", + " doc_id = hit['_id']\n", + " if doc_id not in combined_results:\n", + " combined_results[doc_id] = {'_id': doc_id, '_source': hit['_source'], '_score': 0}\n", + " combined_results[doc_id]['_score'] += 1 / (k + hit['_score'])\n", + "\n", + " combined_results = list(combined_results.values())\n", + " combined_results = sorted(combined_results, key=lambda x: x['_score'], reverse=True)\n", + "\n", + " return {'hits': {'hits': combined_results}}\n", + "\n", + "def hybrid_search(top_K_results,lexical_results, semantic_results, interpolation_weight=0.5, normalizer=\"minmax\",use_rrf=False, rrf_k=60):\n", + " \"\"\"\n", + " Perform hybrid search by combining lexical and semantic search results.\n", + " :param lexical_results: The results from the lexical search.\n", + " :param semantic_results: The results from the semantic search.\n", + " :param interpolation_weight: The interpolation weight for score interpolation.\n", + " :param normalizer: The normalization function (default: minmax normalization).\n", + " :return: The combined search results.\n", + " \"\"\"\n", + " \n", + " if use_rrf:\n", + " return reciprocal_rank_fusion(lexical_results, semantic_results, k=rrf_k)\n", + " \n", + " combined_results = []\n", + "\n", + " # Normalize the scores from lexical and semantic searches\n", + " lexical_scores = [hit['_score'] for hit in lexical_results['hits']['hits']]\n", + " semantic_scores = [hit['_score'] for hit in semantic_results['hits']['hits']]\n", + " normalized_lexical_scores = normalize_scores_(lexical_scores,normalizer)\n", + " normalized_semantic_scores = normalize_scores_(semantic_scores,normalizer)\n", + "\n", + " # Combine the results based on document IDs\n", + " lexical_docs = {hit['_id']: (hit, score) for hit, score in zip(lexical_results['hits']['hits'], normalized_lexical_scores)}\n", + " semantic_docs = {hit['_id']: (hit, score) for hit, score in zip(semantic_results['hits']['hits'], normalized_semantic_scores)}\n", + "\n", + " for doc_id in set(lexical_docs.keys()) | set(semantic_docs.keys()):\n", + " lexical_hit, lexical_score = lexical_docs.get(doc_id, (None, 0))\n", + " semantic_hit, semantic_score = semantic_docs.get(doc_id, (None, 0))\n", + "\n", + " if lexical_hit and semantic_hit:\n", + " # Interpolate scores if both lexical and semantic results are available\n", + " interpolated_score = interpolate_scores(lexical_score, semantic_score, interpolation_weight) \n", + " combined_hit = {\n", + " '_id': doc_id,\n", + " '_source': {**lexical_hit['_source']},\n", + " '_score': interpolated_score, \n", + " }\n", + " elif lexical_hit:\n", + " # Use lexical hit if only lexical result is available\n", + " combined_hit = {\n", + " '_id': doc_id,\n", + " '_source': lexical_hit['_source'],\n", + " '_score': lexical_score\n", + " }\n", + " else:\n", + " # Use semantic hit if only semantic result is available\n", + " combined_hit = {\n", + " '_id': doc_id,\n", + " '_source': semantic_hit['_source'],\n", + " '_score': semantic_score\n", + " }\n", + " combined_results.append(combined_hit)\n", + " # Sort the combined results by the blended score\n", + " combined_results = sorted(combined_results, key=lambda hit: hit['_score'], reverse=True)\n", + " return {'hits': {'hits': combined_results[:top_K_results]}}" + ] + }, + { + "cell_type": "markdown", + "id": "022372f4-4e51-44aa-b5aa-27042cf38885", + "metadata": { + "tags": [] + }, + "source": [ + "#### HYBRID SEARCH AND FILTER" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "ab3f7c64-1ddb-4c4a-a050-67669632790b", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[{'_id': '1%3A0%3AQDNbHJABQBU2LhBAgZnN',\n", + " '_source': {'section_header_ids': 27,\n", + " 'section_title_ids': 4,\n", + " 'passage': 'Net Sales\\nNet sales include product and service sales. Product sales represent revenue from the sale of products and related shipping fees and digital media content where we record revenue gross. Service sales primarily represent third-party seller fees, which includes commissions and any related fulfillment and shipping fees, AWS sales, advertising services, Amazon Prime membership fees, and certain digital media content subscriptions. Net sales information is as follows (in millions):\\n|Year Ended December 31,|Year Ended December 31,\\nNorth America|13 %|12%\\nInternational|4|11\\nAWS|29|13\\nConsolidated|13|12\\nNet Sales Mix:|Net Sales Mix:|Net Sales Mix:\\nNorth America|61 %|61 %\\nInternational|23|23\\nAWS|16|16\\nConsolidated|100 %|100 %\\nSales increased 12% in 2023, compared to the prior year. Changes in foreign exchange rates reduced net sales by $71 million in 2023. For a discussion of the effect of foreign exchange rates on sales growth, see \"Effect of Foreign Exchange Rates\" below.\\nNorth America sales increased 12% in 2023, compared to the prior year. The sales growth primarily reflects increased unit sales, primarily by third-party sellers, advertising sales, and subscription services. Increased unit sales were driven largely by our continued focus on price, selection, and convenience for our customers, including from our shipping offers.',\n", + " 'list': [],\n", + " 'doc_id': 'amazon-2024-10k.pdf',\n", + " 'table': 'Net sales include product and service sales. Product sales represent revenue from the sale of products and related shipping fees and digital media content where we record revenue gross. Service sales primarily represent third-party seller fees, which includes commissions and any related fulfillment and shipping fees, AWS sales, advertising services, Amazon Prime membership fees, and certain digital media content subscriptions. Net sales information is as follows (in millions):\\n|Year Ended December 31,|Year Ended December 31,\\n|2022|2023\\nNet Sales:|Net Sales:|Net Sales:\\nNorth America|$ 315,880|$ 352,828\\nInternational|118,007|131,200\\nAWS|80,096|90,757\\nConsolidated|$ 513,983|$ 574,785\\nYear-over-year Percentage Growth (Decline):|Year-over-year Percentage Growth (Decline):|Year-over-year Percentage Growth (Decline):\\nNorth America|13%|12%\\nInternational|(8)|11\\nAWS|29|13\\nConsolidated|9|12\\nYear-over-year Percentage Growth, excluding the effect of foreign exchange rates:|Year-over-year Percentage Growth, excluding the effect of foreign exchange rates:|Year-over-year Percentage Growth, excluding the effect of foreign exchange rates:\\nNorth America|13 %|12%\\nInternational|4|11\\nAWS|29|13\\nConsolidated|13|12\\nNet Sales Mix:|Net Sales Mix:|Net Sales Mix:\\nNorth America|61 %|61 %\\nInternational|23|23\\nAWS|16|16\\nConsolidated|100 %|100 %\\n'},\n", + " '_score': 1.0},\n", + " {'_id': '1%3A0%3At-lbHJABdP9x6o9wgPm7',\n", + " '_source': {'section_header_ids': 26,\n", + " 'section_title_ids': 4,\n", + " 'passage': 'Net Sales\\nNet sales include product and service sales. Product sales represent revenue from the sale of products and related shipping fees and digital media content where we record revenue gross. Service sales primarily represent third-party seller fees, which includes commissions and any related fulfillment and shipping fees, AWS sales, advertising services, Amazon Prime membership fees, and certain digital media content subscriptions. Net sales information is as follows (in millions):\\n|Year Ended December 31,|Year Ended December 31,\\n|2022|2023\\nNet Sales:|Net Sales:|Net Sales:\\nNorth America|$ 315,880|$ 352,828\\nInternational|118,007|131,200\\nAWS|80,096|90,757\\nConsolidated|$ 513,983|$ 574,785\\nYear-over-year Percentage Growth (Decline):|Year-over-year Percentage Growth (Decline):|Year-over-year Percentage Growth (Decline):\\nNorth America|13%|12%\\nInternational|(8)|11\\nAWS|29|13\\nConsolidated|9|12\\nYear-over-year Percentage Growth, excluding the effect of foreign exchange rates:|Year-over-year Percentage Growth, excluding the effect of foreign exchange rates:|Year-over-year Percentage Growth, excluding the effect of foreign exchange rates:',\n", + " 'list': [],\n", + " 'doc_id': 'amazon-2024-10k.pdf',\n", + " 'table': 'Net sales include product and service sales. Product sales represent revenue from the sale of products and related shipping fees and digital media content where we record revenue gross. Service sales primarily represent third-party seller fees, which includes commissions and any related fulfillment and shipping fees, AWS sales, advertising services, Amazon Prime membership fees, and certain digital media content subscriptions. Net sales information is as follows (in millions):\\n|Year Ended December 31,|Year Ended December 31,\\n|2022|2023\\nNet Sales:|Net Sales:|Net Sales:\\nNorth America|$ 315,880|$ 352,828\\nInternational|118,007|131,200\\nAWS|80,096|90,757\\nConsolidated|$ 513,983|$ 574,785\\nYear-over-year Percentage Growth (Decline):|Year-over-year Percentage Growth (Decline):|Year-over-year Percentage Growth (Decline):\\nNorth America|13%|12%\\nInternational|(8)|11\\nAWS|29|13\\nConsolidated|9|12\\nYear-over-year Percentage Growth, excluding the effect of foreign exchange rates:|Year-over-year Percentage Growth, excluding the effect of foreign exchange rates:|Year-over-year Percentage Growth, excluding the effect of foreign exchange rates:\\nNorth America|13 %|12%\\nInternational|4|11\\nAWS|29|13\\nConsolidated|13|12\\nNet Sales Mix:|Net Sales Mix:|Net Sales Mix:\\nNorth America|61 %|61 %\\nInternational|23|23\\nAWS|16|16\\nConsolidated|100 %|100 %\\n'},\n", + " '_score': 0.8225578991974759},\n", + " {'_id': '1%3A0%3AnDNcHJABQBU2LhBAipng',\n", + " '_source': {'section_header_ids': 67,\n", + " 'section_title_ids': 14,\n", + " 'passage': 'International\\n(3) Includes commissions and any related fulfillment and shipping fees, and other third-party seller services. \\n(4) Includes sales of advertising services to sellers, vendors, publishers, authors, and others, through programs such as sponsored ads, display, and video advertising. \\n(5) Includes annual and monthly fees associated with Amazon Prime memberships, as well as digital video, audiobook, digital music, e-book, and other non- AWS subscription services. \\n(6) Includes sales related to various other offerings, such as certain licensing and distribution of video content, health care services, and shipping services, and our co-branded credit card agreements.\\nNet sales are attributed to countries primarily based on country-focused online and physical stores or, for AWS purposes, the selling entity. Net sales attributed to countries that represent a significant portion of consolidated net sales are as follows (in millions):\\n|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\\n|2021|2022|2023\\nUnited States|$ 314,006|$ 356,113|$ 395,637\\nGermany|37,326|33,598|37,588\\nUnited Kingdom|31,914|30,074|33,591\\nJapan|23,071|24,396|26,002',\n", + " 'list': '\\n(1) Includes product sales and digital media content where we record revenue gross. We leverage our retail infrastructure to offer a wide selection of consumable and durable goods that includes media products available in both a physical and digital format, such as books, videos, games, music, and software. These product sales include digital products sold on a transactional basis. Digital media content subscriptions that provide unlimited viewing or usage rights are included in \"Subscription services.\" \\n(2) Includes product sales where our customers physically select items in a store. Sales to customers who order goods online for delivery or pickup at our physical stores are included in \"Online stores.\" \\n(3) Includes commissions and any related fulfillment and shipping fees, and other third-party seller services. \\n(4) Includes sales of advertising services to sellers, vendors, publishers, authors, and others, through programs such as sponsored ads, display, and video advertising. \\n(5) Includes annual and monthly fees associated with Amazon Prime memberships, as well as digital video, audiobook, digital music, e-book, and other non- AWS subscription services. \\n(6) Includes sales related to various other offerings, such as certain licensing and distribution of video content, health care services, and shipping services, and our co-branded credit card agreements.',\n", + " 'doc_id': 'amazon-2024-10k.pdf',\n", + " 'table': 'Net sales are attributed to countries primarily based on country-focused online and physical stores or, for AWS purposes, the selling entity. Net sales attributed to countries that represent a significant portion of consolidated net sales are as follows (in millions):\\n|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\\n|2021|2022|2023\\nUnited States|$ 314,006|$ 356,113|$ 395,637\\nGermany|37,326|33,598|37,588\\nUnited Kingdom|31,914|30,074|33,591\\nJapan|23,071|24,396|26,002\\nRest of world|63,505|69,802|81,967\\nConsolidated|$ 469,822|$ 513,983|$ 574,785\\n'},\n", + " '_score': 0.7540932873641547},\n", + " {'_id': '1%3A0%3A5OlcHJABdP9x6o9wEPlI',\n", + " '_source': {'section_header_ids': 32,\n", + " 'section_title_ids': 13,\n", + " 'passage': \"Inventories\\nInventories, consisting of products available for sale, are primarily accounted for using the first-in, first-out method, and are valued at the lower of cost and net realizable value. This valuation requires us to make judgments, based on currently available information, about the likely method of disposition, such as through sales to individual customers, returns to product vendors, or liquidations, and expected recoverable values of each disposition category. The inventory valuation allowance, representing a write-down of inventory, was $2.8 billion and $3.0 billion as of December 31, 2022 and 2023.\\nWe provide Fulfillment by Amazon services in connection with certain of our sellers' programs. Third-party sellers maintain ownership of their inventory, regardless of whether fulfillment is provided by us or the third-party sellers, and therefore these products are not included in our inventories.\",\n", + " 'list': [],\n", + " 'doc_id': 'amazon-2024-10k.pdf',\n", + " 'table': []},\n", + " '_score': 0.5880444787362546},\n", + " {'_id': '1%3A0%3AE-lcHJABdP9x6o9wifrR',\n", + " '_source': {'section_header_ids': 66,\n", + " 'section_title_ids': 14,\n", + " 'passage': 'International\\nNet sales by groups of similar products and services, which also have similar economic characteristics, is as follows (in millions):\\n|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\\n|2021|2022|2023\\nNet Sales:|Net Sales:|Net Sales:|Net Sales:\\nOnline stores (1)|$ 222,075|$ 220,004|$ 231,872\\nPhysical stores (2)|17,075|18,963|20,030\\nThird-party seller services (3)|103,366|117,716|140,053\\nAdvertising services (4)|31,160|37,739|46,906\\nSubscription services (5)|31,768|35,218|40,209\\nAWS|62,202|80,096|90,757\\nOther (6)|2,176|4,247|4,958\\nConsolidated|$ 469,822|$ 513,983|$ 574,785\\n(1) Includes product sales and digital media content where we record revenue gross. We leverage our retail infrastructure to offer a wide selection of consumable and durable goods that includes media products available in both a physical and digital format, such as books, videos, games, music, and software. These product sales include digital products sold on a transactional basis. Digital media content subscriptions that provide unlimited viewing or usage rights are included in \"Subscription services.\" \\n(2) Includes product sales where our customers physically select items in a store. Sales to customers who order goods online for delivery or pickup at our physical stores are included in \"Online stores.\" ',\n", + " 'list': '\\n(1) Includes product sales and digital media content where we record revenue gross. We leverage our retail infrastructure to offer a wide selection of consumable and durable goods that includes media products available in both a physical and digital format, such as books, videos, games, music, and software. These product sales include digital products sold on a transactional basis. Digital media content subscriptions that provide unlimited viewing or usage rights are included in \"Subscription services.\" \\n(2) Includes product sales where our customers physically select items in a store. Sales to customers who order goods online for delivery or pickup at our physical stores are included in \"Online stores.\" \\n(3) Includes commissions and any related fulfillment and shipping fees, and other third-party seller services. \\n(4) Includes sales of advertising services to sellers, vendors, publishers, authors, and others, through programs such as sponsored ads, display, and video advertising. \\n(5) Includes annual and monthly fees associated with Amazon Prime memberships, as well as digital video, audiobook, digital music, e-book, and other non- AWS subscription services. \\n(6) Includes sales related to various other offerings, such as certain licensing and distribution of video content, health care services, and shipping services, and our co-branded credit card agreements.',\n", + " 'doc_id': 'amazon-2024-10k.pdf',\n", + " 'table': 'Net sales by groups of similar products and services, which also have similar economic characteristics, is as follows (in millions):\\n|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\\n|2021|2022|2023\\nNet Sales:|Net Sales:|Net Sales:|Net Sales:\\nOnline stores (1)|$ 222,075|$ 220,004|$ 231,872\\nPhysical stores (2)|17,075|18,963|20,030\\nThird-party seller services (3)|103,366|117,716|140,053\\nAdvertising services (4)|31,160|37,739|46,906\\nSubscription services (5)|31,768|35,218|40,209\\nAWS|62,202|80,096|90,757\\nOther (6)|2,176|4,247|4,958\\nConsolidated|$ 469,822|$ 513,983|$ 574,785\\n'},\n", + " '_score': 0.5217161270772315},\n", + " {'_id': '1%3A0%3AijNcHJABQBU2LhBAXZnG',\n", + " '_source': {'section_header_ids': 31,\n", + " 'section_title_ids': 14,\n", + " 'passage': 'Legal Proceedings\\nIn December 2018, Kove IO, Inc. filed a complaint against Amazon Web Services, Inc. in the United States District Court for the Northern District of Illinois. The complaint alleges, among other things, that Amazon S3 and DynamoDB infringe U.S. Patent Nos. 7,814,170 and 7,103,640, each entitled \"Network Distributed Tracking Wire Transfer Protocol\"; and 7,233,978, entitled \"Method and Apparatus for Managing Location Information in a Network Separate from the Data to Which the Location Information Pertains.\" The complaint seeks an unspecified amount of damages, enhanced damages, attorneys\\' fees, costs, interest, and injunctive relief. In March 2022, the case was stayed pending resolution of review petitions we filed with the United States Patent and Trademark Office. In November 2022, the stay was lifted. In July 2023, Kove alleged in its damages report that in the event of a finding of liability Amazon Web Services could be subject to $517 million to $1.03 billion in damages. We dispute the allegations of wrongdoing and intend to defend ourselves vigorously in this matter.',\n", + " 'list': [],\n", + " 'doc_id': 'amazon-2024-10k.pdf',\n", + " 'table': []},\n", + " '_score': 0.5211451616463562},\n", + " {'_id': '1%3A0%3AFOlcHJABdP9x6o9wjPoD',\n", + " '_source': {'section_header_ids': 68,\n", + " 'section_title_ids': 14,\n", + " 'passage': 'International\\nNet sales are attributed to countries primarily based on country-focused online and physical stores or, for AWS purposes, the selling entity. Net sales attributed to countries that represent a significant portion of consolidated net sales are as follows (in millions):\\n|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\\nRest of world|63,505|69,802|81,967\\nConsolidated|$ 469,822|$ 513,983|$ 574,785\\nTotal segment assets exclude corporate assets, such as cash and cash equivalents, marketable securities, other long-term investments, corporate facilities, goodwill and other acquired intangible assets, and tax assets. Technology infrastructure assets are allocated among the segments based on usage, with the majority allocated to the AWS segment. Total segment assets reconciled to consolidated amounts are as follows (in millions):\\n|December 31,|December 31,|December 31,\\n|2021|2022|2023\\nNorth America (1)|$ 161,255|$ 185,268|$ 196,029\\nInternational (1)|57,983|64,666|69,718\\nAWS (2)|63,835|88,491|108,533\\nCorporate|137,476|124,250|153,574\\nConsolidated|$ 420,549|$ 462,675|$ 527,854',\n", + " 'list': [],\n", + " 'doc_id': 'amazon-2024-10k.pdf',\n", + " 'table': 'Net sales are attributed to countries primarily based on country-focused online and physical stores or, for AWS purposes, the selling entity. Net sales attributed to countries that represent a significant portion of consolidated net sales are as follows (in millions):\\n|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\\n|2021|2022|2023\\nUnited States|$ 314,006|$ 356,113|$ 395,637\\nGermany|37,326|33,598|37,588\\nUnited Kingdom|31,914|30,074|33,591\\nJapan|23,071|24,396|26,002\\nRest of world|63,505|69,802|81,967\\nConsolidated|$ 469,822|$ 513,983|$ 574,785\\n\\nTotal segment assets exclude corporate assets, such as cash and cash equivalents, marketable securities, other long-term investments, corporate facilities, goodwill and other acquired intangible assets, and tax assets. Technology infrastructure assets are allocated among the segments based on usage, with the majority allocated to the AWS segment. Total segment assets reconciled to consolidated amounts are as follows (in millions):\\n|December 31,|December 31,|December 31,\\n|2021|2022|2023\\nNorth America (1)|$ 161,255|$ 185,268|$ 196,029\\nInternational (1)|57,983|64,666|69,718\\nAWS (2)|63,835|88,491|108,533\\nCorporate|137,476|124,250|153,574\\nConsolidated|$ 420,549|$ 462,675|$ 527,854\\n'},\n", + " '_score': 0.4485040611146583},\n", + " {'_id': '1%3A0%3AxulbHJABdP9x6o9ws_n3',\n", + " '_source': {'section_header_ids': 3,\n", + " 'section_title_ids': 5,\n", + " 'passage': 'Guidance\\nWe provided guidance on February 1, 2024, in our earnings release furnished on Form 8-K as set forth below. These forward-looking statements reflect Amazon.com\\'s expectations as of February 1, 2024, and are subject to substantial uncertainty. Our results are inherently unpredictable and may be materially affected by many factors, such as fluctuations in foreign exchange rates, changes in global economic and geopolitical conditions and customer demand and spending (including the impact of recessionary fears), inflation, interest rates, regional labor market constraints, world events, the rate of growth of the internet, online commerce, cloud services, and new and emerging technologies, as well as those outlined in Item 1A of Part I, \"Risk Factors.\"\\nFirst Quarter 2024 Guidance:\\nNet sales are expected to be between $138.0 billion and $143.5 billion, or to grow between 8% and 13% compared with first quarter 2023. This guidance anticipates a favorable impact of approximately 40 basis points from foreign exchange rates. \\nOperating income is expected to be between $8.0 billion and $12.0 billion, compared with $4.8 billion in first quarter 2023. This guidance includes approximately $0.9 billion lower depreciation expense due to an increase in the estimated useful life of our servers beginning on January 1, 2024. ',\n", + " 'list': 'First Quarter 2024 Guidance\\nNet sales are expected to be between $138.0 billion and $143.5 billion, or to grow between 8% and 13% compared with first quarter 2023. This guidance anticipates a favorable impact of approximately 40 basis points from foreign exchange rates. \\nOperating income is expected to be between $8.0 billion and $12.0 billion, compared with $4.8 billion in first quarter 2023. This guidance includes approximately $0.9 billion lower depreciation expense due to an increase in the estimated useful life of our servers beginning on January 1, 2024. \\nThis guidance assumes, among other things, that no additional business acquisitions, restructurings, or legal settlements are concluded.',\n", + " 'doc_id': 'amazon-2024-10k.pdf',\n", + " 'table': []},\n", + " '_score': 0.43465325177195013},\n", + " {'_id': '1%3A0%3ARDNbHJABQBU2LhBAj5ku',\n", + " '_source': {'section_header_ids': 35,\n", + " 'section_title_ids': 4,\n", + " 'passage': 'Fulfillment\\nFulfillment costs primarily consist of those costs incurred in operating and staffing our North America and International fulfillment centers, physical stores, and customer service centers and payment processing costs. While AWS payment processing and related transaction costs are included in \"Fulfillment,\" AWS costs are primarily classified as \"Technology and infrastructure.\" Fulfillment costs as a percentage of net sales may vary due to several factors, such as payment processing and related transaction costs, our level of productivity and accuracy, changes in volume, size, and weight of units received and\\nfulfilled, the extent to which third-party sellers utilize Fulfillment by Amazon services, timing of fulfillment network and physical store expansion, the extent we utilize fulfillment services provided by third parties, mix of products and services sold, and our ability to affect customer service contacts per unit by implementing improvements in our operations and enhancements to our customer self-service features. Additionally, sales by our sellers have higher payment processing and related transaction costs as a percentage of net sales compared to our retail sales because payment processing costs are based on the gross purchase price of underlying transactions.',\n", + " 'list': [],\n", + " 'doc_id': 'amazon-2024-10k.pdf',\n", + " 'table': []},\n", + " '_score': 0.40502606023551524},\n", + " {'_id': '1%3A0%3AuOlbHJABdP9x6o9wg_n0',\n", + " '_source': {'section_header_ids': 28,\n", + " 'section_title_ids': 4,\n", + " 'passage': 'Net Sales\\nInternational sales increased 11% in 2023, compared to the prior year. The sales growth primarily reflects increased unit sales, primarily by third-party sellers, advertising sales, and subscription services. Increased unit sales were driven largely by our continued focus on price, selection, and convenience for our customers, including from our shipping offers. Changes in foreign exchange rates increased International net sales by $88 million in 2023.\\nAWS sales increased 13% in 2023, compared to the prior year. The sales growth primarily reflects increased customer usage, partially offset by pricing changes, primarily driven by long-term customer contracts.',\n", + " 'list': [],\n", + " 'doc_id': 'amazon-2024-10k.pdf',\n", + " 'table': []},\n", + " '_score': 0.35945168243908654}]" + ] + }, + "execution_count": 75, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from opensearchpy import Transport\n", + "credentials = boto3.Session().get_credentials()\n", + "awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, \"us-east-1\", service, session_token=credentials.token)\n", + "transport = Transport(\n", + " hosts = [{'host': domain_endpoint, 'port': 443}],\n", + " http_auth = awsauth,\n", + " use_ssl = True,\n", + " verify_certs = True,\n", + " timeout=120, \n", + " # http_compress = True, # enables gzip compression for request bodies\n", + " connection_class = RequestsHttpConnection\n", + ")\n", + "question=\"Amazon net sales by geographical location and products in 2023?\"\n", + "embedding=_get_emb_(question)\n", + "# Top K returned results\n", + "top_K_results=10\n", + "# Define the search query\n", + "\n", + "search_requests = [\n", + " ({}, {\"query\": {\"match\": {\"passage\": question}}, \"size\": top_K_results, \"_source\": {\"exclude\": [\"embedding\"]}}),\n", + " ({}, {\"query\": {\"knn\": {\"embedding\": {\"vector\": embedding, \"k\": 3}}}, \"size\": top_K_results, \"_source\": {\"exclude\": [\"embedding\"]}})\n", + "]\n", + "\n", + "# Convert the search requests to NDJSON format\n", + "data = \"\"\n", + "for metadata, request in search_requests:\n", + " data += f\"{json.dumps(metadata)}\\n{json.dumps(request)}\\n\"\n", + "response = transport.perform_request(\"GET\", f\"/{domain_index}/_msearch\", body=data)\n", + "# Separate the results \n", + "lexical_search_results = response['responses'][0]\n", + "semantic_search_results = response['responses'][1]\n", + "# Use the custom hybrid search function\n", + "hybrid_results = hybrid_search(top_K_results,lexical_search_results, semantic_search_results, \n", + " interpolation_weight=0.5, normalizer=\"minmax\", use_rrf=False, rrf_k=100)\n", + "\n", + "# Implement a combination technique or just pass one of either lexical or semantic search back\n", + "response= hybrid_results\n", + "response['hits']['hits']" + ] + }, + { + "cell_type": "markdown", + "id": "1a5f8a22-0afe-4c58-b3ab-c49843c16962", + "metadata": {}, + "source": [ + "The capability to select hierarchical information of interest from the retrieved response is based on the corresponding matched passage chunk. This allows for flexibility in determining the amount of information provided to the LLM, accommodating queries that benefit from access to full section information.\n", + "\n", + "This approach entails storing chunk hierarchical sections and titles in an S3 bucket. If additional information beyond the passage chunk in the OpenSearch index is necessary, it retrieves the corresponding sections based on the indexed matching IDs associated with the chunk." + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "6c3a127e-74c4-4500-a48d-084d026108a4", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def read_file_from_s3(bucket_name, key, section_content, title_id=None,section_id=None):\n", + " \"\"\"\n", + " Read a file from an S3 bucket and extract specific sections based on given parameters.\n", + " Parameters:\n", + " bucket_name (str): The name of the S3 bucket.\n", + " key (str): The key (path) of the file in the S3 bucket.\n", + " section_content (str): Specifies the type of section content to extract.\n", + " Possible values: \"section_header\" or \"section_title\".\n", + " title_id (str or int, optional): The ID of the title. Required if `section_content` is \"section_header\".\n", + " section_id (str or int, optional): The ID of the section. Required if `section_content` is \"section_header\".\n", + "\n", + " Returns:\n", + " str or None: The extracted section content as a string. Returns None if there's an error.\n", + " \"\"\"\n", + " try: \n", + " response = s3.get_object(Bucket=bucket_name, Key=key) \n", + " file_content = response['Body'].read().decode('utf-8')\n", + " file_content=json.loads(file_content)\n", + " if section_content==\"section_header\":\n", + " passage=file_content[str(title_id)][str(section_id)]\n", + " elif section_content==\"section_title\": \n", + " # Join each sublist into a string\n", + " strings = [\"\\n\".join(sublist) for sublist in list(file_content[str(title_id)].values())]\n", + " # Convert to a set to remove duplicates\n", + " passage = OrderedDict.fromkeys(strings)\n", + " return \"\\n\".join(passage)\n", + " except Exception as e:\n", + " print(f\"Error reading {key} from S3 bucket {bucket_name}:\", e)\n", + " return None\n", + "\n", + "class InvalidContentError(Exception):\n", + " pass\n", + "\n", + "# Extract relevant information from the search response\n", + "def content_extraction_os_(response:str, table:bool, lyst:bool,section_content:str):\n", + " \"\"\"\n", + " Extracts content from the OpenSearch response based on specified parameters.\n", + "\n", + " Parameters:\n", + " response (dict): The response from OpenSearch containing search results.\n", + " table (bool): A boolean indicating whether to include table content.\n", + " lyst (bool): A boolean indicating whether to include list content.\n", + " section_content (str): The type of content to extract. Allowed values are 'passage', 'section_header', or 'section_title'.\n", + "\n", + " Returns:\n", + " tuple: A tuple containing concatenated passages and tables.\n", + " \"\"\"\n", + " allowed_values = {\"passage\", \"section_header\", \"section_title\"} # Define allowed values\n", + " if section_content not in allowed_values:\n", + " raise InvalidContentError(f\"Invalid content type '{section_content}'. Allowed values are {', '.join(allowed_values)}.\")\n", + " \n", + " res=response['hits']['hits']\n", + " score = [str(x['_score']) for x in res] #retrieval score \n", + " # title_names = [x['_source']['title_headers'] for x in res] #doc page number of chunks\n", + " doc_name = [x['_source']['doc_id'] for x in res] # doc names\n", + " header_ids = [x['_source']['section_header_ids'] for x in res] # section header id\n", + " title_ids=[x['_source']['section_title_ids'] for x in res] # section title id\n", + " tables=\"\"\n", + " lists=\"\"\n", + " \n", + " if section_content==\"passage\":\n", + " passage = [x['_source'][\"passage\"] for x in res] #retrieved passages, here you can choose to retrieve the complete section header or title instead of the chunk passage\n", + " tables=[x['_source']['table'] for x in res] # tables in the corresponding chunk\n", + " lists=[x['_source']['list'] for x in res]\n", + " else:\n", + " passage=[]\n", + " for x in range(len(title_ids)):\n", + " passage.append(read_file_from_s3(BUCKET, f\"{doc_name[x]}.json\",section_content,title_ids[x],header_ids[x]))\n", + " passage=set(passage) \n", + " p = inflect.engine()\n", + " ## Concatenate passages and tables to use in prompt template \n", + " passages=\"\"\n", + " tab=\"\"\n", + " lst=\"\"\n", + " for ids,text in enumerate(passage):\n", + " passages+=f\"<{p.ordinal(ids+1)}_passage>\\n{text}\\n\\n\"\n", + " if table and tables:\n", + " for ids,text in enumerate(tables): \n", + " tab+=f\"<{p.ordinal(ids+1)}_passage_table>\\n{text}\\n\\n\" #Table can be coupled with passage chunks to provide more information.\n", + " if lyst and lists:\n", + " for ids,text in enumerate(lists): \n", + " lst+=f\"<{p.ordinal(ids+1)}_passage_lists>\\n{text}\\n\\n\" \n", + " return passages, tab, lst,passage,tables,lists" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "18ec4037-5b45-42ff-bda8-c5dc888b7c7e", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "<1st_passage>\n", + "Net Sales\n", + "Net sales include product and service sales. Product sales represent revenue from the sale of products and related shipping fees and digital media content where we record revenue gross. Service sales primarily represent third-party seller fees, which includes commissions and any related fulfillment and shipping fees, AWS sales, advertising services, Amazon Prime membership fees, and certain digital media content subscriptions. Net sales information is as follows (in millions):\n", + "|Year Ended December 31,|Year Ended December 31,\n", + "North America|13 %|12%\n", + "International|4|11\n", + "AWS|29|13\n", + "Consolidated|13|12\n", + "Net Sales Mix:|Net Sales Mix:|Net Sales Mix:\n", + "North America|61 %|61 %\n", + "International|23|23\n", + "AWS|16|16\n", + "Consolidated|100 %|100 %\n", + "Sales increased 12% in 2023, compared to the prior year. Changes in foreign exchange rates reduced net sales by $71 million in 2023. For a discussion of the effect of foreign exchange rates on sales growth, see \"Effect of Foreign Exchange Rates\" below.\n", + "North America sales increased 12% in 2023, compared to the prior year. The sales growth primarily reflects increased unit sales, primarily by third-party sellers, advertising sales, and subscription services. Increased unit sales were driven largely by our continued focus on price, selection, and convenience for our customers, including from our shipping offers.\n", + "\n", + "<2nd_passage>\n", + "Net Sales\n", + "Net sales include product and service sales. Product sales represent revenue from the sale of products and related shipping fees and digital media content where we record revenue gross. Service sales primarily represent third-party seller fees, which includes commissions and any related fulfillment and shipping fees, AWS sales, advertising services, Amazon Prime membership fees, and certain digital media content subscriptions. Net sales information is as follows (in millions):\n", + "|Year Ended December 31,|Year Ended December 31,\n", + "|2022|2023\n", + "Net Sales:|Net Sales:|Net Sales:\n", + "North America|$ 315,880|$ 352,828\n", + "International|118,007|131,200\n", + "AWS|80,096|90,757\n", + "Consolidated|$ 513,983|$ 574,785\n", + "Year-over-year Percentage Growth (Decline):|Year-over-year Percentage Growth (Decline):|Year-over-year Percentage Growth (Decline):\n", + "North America|13%|12%\n", + "International|(8)|11\n", + "AWS|29|13\n", + "Consolidated|9|12\n", + "Year-over-year Percentage Growth, excluding the effect of foreign exchange rates:|Year-over-year Percentage Growth, excluding the effect of foreign exchange rates:|Year-over-year Percentage Growth, excluding the effect of foreign exchange rates:\n", + "\n", + "<3rd_passage>\n", + "International\n", + "(3) Includes commissions and any related fulfillment and shipping fees, and other third-party seller services. \n", + "(4) Includes sales of advertising services to sellers, vendors, publishers, authors, and others, through programs such as sponsored ads, display, and video advertising. \n", + "(5) Includes annual and monthly fees associated with Amazon Prime memberships, as well as digital video, audiobook, digital music, e-book, and other non- AWS subscription services. \n", + "(6) Includes sales related to various other offerings, such as certain licensing and distribution of video content, health care services, and shipping services, and our co-branded credit card agreements.\n", + "Net sales are attributed to countries primarily based on country-focused online and physical stores or, for AWS purposes, the selling entity. Net sales attributed to countries that represent a significant portion of consolidated net sales are as follows (in millions):\n", + "|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\n", + "|2021|2022|2023\n", + "United States|$ 314,006|$ 356,113|$ 395,637\n", + "Germany|37,326|33,598|37,588\n", + "United Kingdom|31,914|30,074|33,591\n", + "Japan|23,071|24,396|26,002\n", + "\n", + "<4th_passage>\n", + "Inventories\n", + "Inventories, consisting of products available for sale, are primarily accounted for using the first-in, first-out method, and are valued at the lower of cost and net realizable value. This valuation requires us to make judgments, based on currently available information, about the likely method of disposition, such as through sales to individual customers, returns to product vendors, or liquidations, and expected recoverable values of each disposition category. The inventory valuation allowance, representing a write-down of inventory, was $2.8 billion and $3.0 billion as of December 31, 2022 and 2023.\n", + "We provide Fulfillment by Amazon services in connection with certain of our sellers' programs. Third-party sellers maintain ownership of their inventory, regardless of whether fulfillment is provided by us or the third-party sellers, and therefore these products are not included in our inventories.\n", + "\n", + "<5th_passage>\n", + "International\n", + "Net sales by groups of similar products and services, which also have similar economic characteristics, is as follows (in millions):\n", + "|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\n", + "|2021|2022|2023\n", + "Net Sales:|Net Sales:|Net Sales:|Net Sales:\n", + "Online stores (1)|$ 222,075|$ 220,004|$ 231,872\n", + "Physical stores (2)|17,075|18,963|20,030\n", + "Third-party seller services (3)|103,366|117,716|140,053\n", + "Advertising services (4)|31,160|37,739|46,906\n", + "Subscription services (5)|31,768|35,218|40,209\n", + "AWS|62,202|80,096|90,757\n", + "Other (6)|2,176|4,247|4,958\n", + "Consolidated|$ 469,822|$ 513,983|$ 574,785\n", + "(1) Includes product sales and digital media content where we record revenue gross. We leverage our retail infrastructure to offer a wide selection of consumable and durable goods that includes media products available in both a physical and digital format, such as books, videos, games, music, and software. These product sales include digital products sold on a transactional basis. Digital media content subscriptions that provide unlimited viewing or usage rights are included in \"Subscription services.\" \n", + "(2) Includes product sales where our customers physically select items in a store. Sales to customers who order goods online for delivery or pickup at our physical stores are included in \"Online stores.\" \n", + "\n", + "<6th_passage>\n", + "Legal Proceedings\n", + "In December 2018, Kove IO, Inc. filed a complaint against Amazon Web Services, Inc. in the United States District Court for the Northern District of Illinois. The complaint alleges, among other things, that Amazon S3 and DynamoDB infringe U.S. Patent Nos. 7,814,170 and 7,103,640, each entitled \"Network Distributed Tracking Wire Transfer Protocol\"; and 7,233,978, entitled \"Method and Apparatus for Managing Location Information in a Network Separate from the Data to Which the Location Information Pertains.\" The complaint seeks an unspecified amount of damages, enhanced damages, attorneys' fees, costs, interest, and injunctive relief. In March 2022, the case was stayed pending resolution of review petitions we filed with the United States Patent and Trademark Office. In November 2022, the stay was lifted. In July 2023, Kove alleged in its damages report that in the event of a finding of liability Amazon Web Services could be subject to $517 million to $1.03 billion in damages. We dispute the allegations of wrongdoing and intend to defend ourselves vigorously in this matter.\n", + "\n", + "<7th_passage>\n", + "International\n", + "Net sales are attributed to countries primarily based on country-focused online and physical stores or, for AWS purposes, the selling entity. Net sales attributed to countries that represent a significant portion of consolidated net sales are as follows (in millions):\n", + "|Year Ended December 31,|Year Ended December 31,|Year Ended December 31,\n", + "Rest of world|63,505|69,802|81,967\n", + "Consolidated|$ 469,822|$ 513,983|$ 574,785\n", + "Total segment assets exclude corporate assets, such as cash and cash equivalents, marketable securities, other long-term investments, corporate facilities, goodwill and other acquired intangible assets, and tax assets. Technology infrastructure assets are allocated among the segments based on usage, with the majority allocated to the AWS segment. Total segment assets reconciled to consolidated amounts are as follows (in millions):\n", + "|December 31,|December 31,|December 31,\n", + "|2021|2022|2023\n", + "North America (1)|$ 161,255|$ 185,268|$ 196,029\n", + "International (1)|57,983|64,666|69,718\n", + "AWS (2)|63,835|88,491|108,533\n", + "Corporate|137,476|124,250|153,574\n", + "Consolidated|$ 420,549|$ 462,675|$ 527,854\n", + "\n", + "<8th_passage>\n", + "Guidance\n", + "We provided guidance on February 1, 2024, in our earnings release furnished on Form 8-K as set forth below. These forward-looking statements reflect Amazon.com's expectations as of February 1, 2024, and are subject to substantial uncertainty. Our results are inherently unpredictable and may be materially affected by many factors, such as fluctuations in foreign exchange rates, changes in global economic and geopolitical conditions and customer demand and spending (including the impact of recessionary fears), inflation, interest rates, regional labor market constraints, world events, the rate of growth of the internet, online commerce, cloud services, and new and emerging technologies, as well as those outlined in Item 1A of Part I, \"Risk Factors.\"\n", + "First Quarter 2024 Guidance:\n", + "Net sales are expected to be between $138.0 billion and $143.5 billion, or to grow between 8% and 13% compared with first quarter 2023. This guidance anticipates a favorable impact of approximately 40 basis points from foreign exchange rates. \n", + "Operating income is expected to be between $8.0 billion and $12.0 billion, compared with $4.8 billion in first quarter 2023. This guidance includes approximately $0.9 billion lower depreciation expense due to an increase in the estimated useful life of our servers beginning on January 1, 2024. \n", + "\n", + "<9th_passage>\n", + "Fulfillment\n", + "Fulfillment costs primarily consist of those costs incurred in operating and staffing our North America and International fulfillment centers, physical stores, and customer service centers and payment processing costs. While AWS payment processing and related transaction costs are included in \"Fulfillment,\" AWS costs are primarily classified as \"Technology and infrastructure.\" Fulfillment costs as a percentage of net sales may vary due to several factors, such as payment processing and related transaction costs, our level of productivity and accuracy, changes in volume, size, and weight of units received and\n", + "fulfilled, the extent to which third-party sellers utilize Fulfillment by Amazon services, timing of fulfillment network and physical store expansion, the extent we utilize fulfillment services provided by third parties, mix of products and services sold, and our ability to affect customer service contacts per unit by implementing improvements in our operations and enhancements to our customer self-service features. Additionally, sales by our sellers have higher payment processing and related transaction costs as a percentage of net sales compared to our retail sales because payment processing costs are based on the gross purchase price of underlying transactions.\n", + "\n", + "<10th_passage>\n", + "Net Sales\n", + "International sales increased 11% in 2023, compared to the prior year. The sales growth primarily reflects increased unit sales, primarily by third-party sellers, advertising sales, and subscription services. Increased unit sales were driven largely by our continued focus on price, selection, and convenience for our customers, including from our shipping offers. Changes in foreign exchange rates increased International net sales by $88 million in 2023.\n", + "AWS sales increased 13% in 2023, compared to the prior year. The sales growth primarily reflects increased customer usage, partially offset by pricing changes, primarily driven by long-term customer contracts.\n", + "\n", + "\n" + ] + } + ], + "source": [ + "passages,tab,lyst,passage,tables,lists=content_extraction_os_(response, False,False, \"passage\")\n", + "print(passages)" + ] + }, + { + "cell_type": "markdown", + "id": "5e5e4258-95a2-4000-97e1-12659f4eef3a", + "metadata": {}, + "source": [ + "## Bedrock Anthropic LLM Inference" + ] + }, + { + "cell_type": "markdown", + "id": "aeb1161d-5e69-4832-9d27-a8e417a2d4a8", + "metadata": {}, + "source": [ + "Using the a prompt template with placeholders for the retrieved passages as `passages` under **document** tags and any retrieved standalone tables and list found within each retrieved passages as `tab` under **additional_information** tags.\\\n", + "Change the `csv_seperator` variable name to what was used during chunking. default is \"|\" pipe character.\\\n", + "Anthropic Claude models (Claude 3 and 2) is used to generate a response to the user question." + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "edc0c8ab-5cf5-4200-829f-08c576db2d45", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Size of prompt token is 2938\n" + ] + } + ], + "source": [ + "csv_seperator=\"|\"\n", + "prompt_template=f\"\"\"You are a helpful, obedient and truthful financial assistance.\n", + "\n", + "\n", + "{passages}\n", + " \n", + "\n", + "\n", + "{tab}\n", + "\n", + "\n", + "\n", + "When providing your response based on the document:\n", + "1. Understand the question to know what is being asked of you.\n", + "2. Review the entire document provided and check if it contains relevant information to answer the question. Only pay attention to passages with relevant information.\n", + "3. Any tables provided within the document or additional information are delimited by {csv_seperator} character.\n", + "4. If the document is sufficient to answer the question, provide a comprehensive answer ENTIRELY based on the document provided. DO NOT make up answers not present in the document.\n", + "5. If the answer is not available in the document, say so.\n", + "\n", + "\n", + "Question: {question}\n", + "if able to answer:\n", + " Include in your response before your answer: \n", + " document or additional info tag(s) containing the relevant info\"\"\"\n", + "\n", + "print(f' Size of prompt token is {client.count_tokens(prompt_template)}')" + ] + }, + { + "cell_type": "markdown", + "id": "742791f3-351d-4c07-a612-fb148ddc88c1", + "metadata": {}, + "source": [ + "The groundtruth can be found on page 69 of the amazon 2024 10K document" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "a878e4c1-b90a-4853-9e09-073408bbc2ac", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3rd_passage, 5th_passage, 7th_passage\n", + "\n", + "Based on the information provided in the document, here are the key details on Amazon's net sales by geographical location and products in 2023:\n", + "\n", + "Net sales by geographical location in 2023:\n", + "- United States: $395,637 million\n", + "- Germany: $37,588 million\n", + "- United Kingdom: $33,591 million \n", + "- Japan: $26,002 million\n", + "- Rest of world: $81,967 million\n", + "\n", + "Net sales by product and service categories in 2023:\n", + "- Online stores: $231,872 million\n", + "- Third-party seller services: $140,053 million\n", + "- AWS: $90,757 million\n", + "- Advertising services: $46,906 million\n", + "- Subscription services: $40,209 million\n", + "- Physical stores: $20,030 million\n", + "- Other: $4,958 million\n", + "Input Tokens: 3076\n", + "Output Tokens: 213\n" + ] + } + ], + "source": [ + "model_id=\"anthropic.claude-3-haiku-20240307-v1:0\" #\"anthropic.claude-3-sonnet-20240229-v1:0\"\"anthropic.claude-v2\",\"anthropic.claude-3-haiku-20240307-v1:0\"\n", + "model_response,input_tokens, output_tokens=_invoke_bedrock_with_retries([], \"\", prompt_template,model_id , [])" + ] + }, + { + "cell_type": "markdown", + "id": "e59db78a-71ce-418b-9566-fc7ca8167490", + "metadata": {}, + "source": [ + "# Conclusion" + ] + }, + { + "cell_type": "markdown", + "id": "d98ecd86-b1a9-4fb7-8b99-f5d382f53d0d", + "metadata": {}, + "source": [ + "This notebook showcases the extraction of content from a document while maintaining its layout structure. Additionally, we processed and chunked the document, ensuring the integrity of the information was preserved. Furthermore, we indexed these chunks and associated hierarchical metadata information, offering flexibility in information retrieval. \n", + "\n", + "Finally, we conducted a RAG query and generated contextual answers." + ] + }, + { + "cell_type": "markdown", + "id": "a44100a6-c2e3-4d73-877f-5a99786ec2bd", + "metadata": {}, + "source": [ + "# Delete Resources" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b44ff2a4-4adf-46bb-8b4f-313a0a764464", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import botocore\n", + "\n", + "def deleteDomain(client, domainName):\n", + " \"\"\"Deletes an OpenSearch Service domain. Deleting a domain can take several minutes.\"\"\"\n", + " try:\n", + " response = client.delete_collection(\n", + " id=domainName\n", + " )\n", + " print('Sending domain deletion request...')\n", + " print(response)\n", + "\n", + " except botocore.exceptions.ClientError as error:\n", + " if error.response['Error']['Code'] == 'ResourceNotFoundException':\n", + " print('Domain not found. Please check the domain name.')\n", + " else:\n", + " raise error\n", + " \n", + "domain_id=aoss.batch_get_collection(\n", + " names=['idp-workshop-aoss'])['collectionDetails'][0]['id']\n", + "deleteDomain(aoss, domain_id)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cae65482-9f6a-45e9-9bbd-9a36c5518201", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "availableInstances": [ + { + "_defaultOrder": 0, + "_isFastLaunch": true, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 4, + "name": "ml.t3.medium", + "vcpuNum": 2 + }, + { + "_defaultOrder": 1, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.t3.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 2, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.t3.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 3, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.t3.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 4, + "_isFastLaunch": true, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.m5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 5, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.m5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 6, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.m5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 7, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.m5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 8, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.m5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 9, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.m5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 10, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.m5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 11, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.m5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 12, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.m5d.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 13, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.m5d.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 14, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.m5d.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 15, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.m5d.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 16, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.m5d.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 17, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.m5d.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 18, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.m5d.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 19, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.m5d.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 20, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": true, + "memoryGiB": 0, + "name": "ml.geospatial.interactive", + "supportedImageNames": [ + "sagemaker-geospatial-v1-0" + ], + "vcpuNum": 0 + }, + { + "_defaultOrder": 21, + "_isFastLaunch": true, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 4, + "name": "ml.c5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 22, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.c5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 23, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.c5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 24, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.c5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 25, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 72, + "name": "ml.c5.9xlarge", + "vcpuNum": 36 + }, + { + "_defaultOrder": 26, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 96, + "name": "ml.c5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 27, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 144, + "name": "ml.c5.18xlarge", + "vcpuNum": 72 + }, + { + "_defaultOrder": 28, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.c5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 29, + "_isFastLaunch": true, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.g4dn.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 30, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.g4dn.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 31, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.g4dn.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 32, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.g4dn.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 33, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.g4dn.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 34, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.g4dn.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 35, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 61, + "name": "ml.p3.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 36, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 244, + "name": "ml.p3.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 37, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 488, + "name": "ml.p3.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 38, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.p3dn.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 39, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.r5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 40, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.r5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 41, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.r5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 42, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.r5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 43, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.r5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 44, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.r5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 45, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 512, + "name": "ml.r5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 46, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.r5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 47, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.g5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 48, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.g5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 49, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.g5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 50, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.g5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 51, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.g5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 52, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.g5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 53, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.g5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 54, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.g5.48xlarge", + "vcpuNum": 192 + }, + { + "_defaultOrder": 55, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 1152, + "name": "ml.p4d.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 56, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 1152, + "name": "ml.p4de.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 57, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.trn1.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 58, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 512, + "name": "ml.trn1.32xlarge", + "vcpuNum": 128 + }, + { + "_defaultOrder": 59, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 512, + "name": "ml.trn1n.32xlarge", + "vcpuNum": 128 + } + ], + "instance_type": "ml.t3.medium", + "kernelspec": { + "display_name": "Python 3 (Data Science 3.0)", + "language": "python", + "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/gen-ai/images/amazonsus2022.jpg b/gen-ai/images/amazonsus2022.jpg new file mode 100644 index 0000000..c22613a Binary files /dev/null and b/gen-ai/images/amazonsus2022.jpg differ diff --git a/gen-ai/images/complex-tables.png b/gen-ai/images/complex-tables.png new file mode 100644 index 0000000..c9fdc84 Binary files /dev/null and b/gen-ai/images/complex-tables.png differ diff --git a/gen-ai/images/layout-hierarchy.jpg b/gen-ai/images/layout-hierarchy.jpg new file mode 100644 index 0000000..8f9d4f5 Binary files /dev/null and b/gen-ai/images/layout-hierarchy.jpg differ diff --git a/gen-ai/images/list-chunker.png b/gen-ai/images/list-chunker.png new file mode 100644 index 0000000..e7e9412 Binary files /dev/null and b/gen-ai/images/list-chunker.png differ diff --git a/gen-ai/images/rag-sect.jpg b/gen-ai/images/rag-sect.jpg new file mode 100644 index 0000000..e8fc70e Binary files /dev/null and b/gen-ai/images/rag-sect.jpg differ diff --git a/gen-ai/images/table-chunkers.png b/gen-ai/images/table-chunkers.png new file mode 100644 index 0000000..cb0a674 Binary files /dev/null and b/gen-ai/images/table-chunkers.png differ diff --git a/gen-ai/images/text b/gen-ai/images/text new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/gen-ai/images/text @@ -0,0 +1 @@ + diff --git a/gen-ai/images/text-chunks.png b/gen-ai/images/text-chunks.png new file mode 100644 index 0000000..6f17dd8 Binary files /dev/null and b/gen-ai/images/text-chunks.png differ diff --git a/gen-ai/images/txt-layout-Page-2.jpg b/gen-ai/images/txt-layout-Page-2.jpg new file mode 100644 index 0000000..e3d2c5d Binary files /dev/null and b/gen-ai/images/txt-layout-Page-2.jpg differ diff --git a/gen-ai/samples/amazon-2024-10k.pdf b/gen-ai/samples/amazon-2024-10k.pdf new file mode 100644 index 0000000..1ec58c7 Binary files /dev/null and b/gen-ai/samples/amazon-2024-10k.pdf differ