Skip to content

Commit

Permalink
Adding Azure Cog Search Vector capabilites, Azure Form Recognizer and…
Browse files Browse the repository at this point in the history
… New Notebook for complex files
  • Loading branch information
pablomarin committed Aug 19, 2023
1 parent 6640592 commit 5bde842
Show file tree
Hide file tree
Showing 23 changed files with 3,199 additions and 3,031 deletions.
166 changes: 117 additions & 49 deletions 01-Load-Data-ACogSearch.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -20,22 +20,16 @@
"\n",
"This notebook creates the following objects on your search service:\n",
"\n",
"+ search index\n",
"+ data source\n",
"+ skillset\n",
"+ search index\n",
"+ indexer\n",
"\n",
"This notebook calls the [Search REST APIs](https://docs.microsoft.com/rest/api/searchservice/), but you can also use the Azure.Search.Documents client library in the Azure SDK for Python to perform the same steps. See this [Python quickstart](https://docs.microsoft.com/azure/search/search-get-started-python) for details.\n",
"\n",
"To run this notebook, you should have already created the Azure services on README. Once you've done this, you can run all cells, but the query won't return results until the indexer is finished and the search index is loaded. \n",
"\n",
"We recommend running each step and making sure it completes before moving on.\n",
"\n",
"Reference:\n",
"\n",
"https://learn.microsoft.com/en-us/azure/search/cognitive-search-tutorial-blob\n",
"\n",
"https://github.com/Azure-Samples/azure-search-python-samples/blob/main/Tutorial-AI-Enrichment/PythonTutorial-AzureSearch-AIEnrichment.ipynb"
"We recommend running each step and making sure it completes before moving on."
]
},
{
Expand Down Expand Up @@ -101,7 +95,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"201\n",
"204\n",
"True\n"
]
}
Expand All @@ -117,9 +111,8 @@
" \"connectionString\": os.environ['BLOB_CONNECTION_STRING']\n",
" },\n",
" \"dataDeletionDetectionPolicy\" : {\n",
" \"@odata.type\" :\"#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy\"\n",
" }\n",
"\n",
" \"@odata.type\" :\"#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy\" # this makes sure that if the item is deleted from the source, it will be deleted from the index\n",
" },\n",
" \"container\": {\n",
" \"name\": BLOB_CONTAINER_NAME\n",
" }\n",
Expand All @@ -131,20 +124,24 @@
]
},
{
"cell_type": "code",
"execution_count": 5,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"# If you have a 403 code, probably you have a wrong endpoint or key, you can debug by uncomment this\n",
"# r.text"
"- 201 - Successfully created\n",
"- 204 - Succesfully overwritten\n",
"- 40X - Authentication Error\n",
"\n",
"For information on Change and Delete file detection please see [HERE](https://learn.microsoft.com/en-us/azure/search/search-howto-index-changed-deleted-blobs?tabs=rest-api)"
]
},
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"For information on Change and Delete file detection please see [HERE](https://learn.microsoft.com/en-us/azure/search/search-howto-index-changed-deleted-blobs?tabs=rest-api)"
"# If you have a 403 code, probably you have a wrong endpoint or key, you can debug by uncomment this\n",
"# r.text"
]
},
{
Expand All @@ -162,14 +159,14 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"201\n",
"204\n",
"True\n"
]
}
Expand Down Expand Up @@ -289,7 +286,7 @@
" {\n",
" \"@odata.type\": \"#Microsoft.Skills.Text.V3.EntityRecognitionSkill\",\n",
" \"context\": \"/document/pages/*\",\n",
" \"categories\": [\"Person\", \"Location\", \"Organization\", \"DateTime\", \"URL\", \"Email\"],\n",
" \"categories\": [\"Person\", \"URL\", \"Email\"],\n",
" \"minimumPrecision\": 0.5, \n",
" \"defaultLanguageCode\": \"en\",\n",
" \"inputs\": [\n",
Expand All @@ -308,14 +305,6 @@
" \"targetName\": \"persons\"\n",
" },\n",
" {\n",
" \"name\": \"locations\", \n",
" \"targetName\": \"locations\"\n",
" },\n",
" {\n",
" \"name\": \"organizations\", \n",
" \"targetName\": \"organizations\"\n",
" },\n",
" {\n",
" \"name\": \"urls\", \n",
" \"targetName\": \"urls\"\n",
" },\n",
Expand Down Expand Up @@ -361,14 +350,14 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"201\n",
"204\n",
"True\n"
]
}
Expand All @@ -390,8 +379,6 @@
" {\"name\": \"images_text\", \"type\": \"Collection(Edm.String)\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},\n",
" {\"name\": \"keyPhrases\", \"type\": \"Collection(Edm.String)\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"true\", \"facetable\": \"true\"},\n",
" {\"name\": \"persons\", \"type\": \"Collection(Edm.String)\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},\n",
" {\"name\": \"locations\", \"type\": \"Collection(Edm.String)\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"true\", \"facetable\": \"true\"},\n",
" {\"name\": \"organizations\", \"type\": \"Collection(Edm.String)\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"true\", \"facetable\": \"true\"},\n",
" {\"name\": \"urls\", \"type\": \"Collection(Edm.String)\", \"searchable\": \"false\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},\n",
" {\"name\": \"emails\", \"type\": \"Collection(Edm.String)\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"true\", \"facetable\": \"false\"}\n",
" \n",
Expand Down Expand Up @@ -424,7 +411,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -439,6 +426,8 @@
},
"source": [
"### Semantic Search capabilities\n",
"As you can see above in the index payload, there is a `semantic configuration`. What is that?\n",
"\n",
"Azure Search has a feature called: Semantic Search. This is a Deep Neural Network that lives on the engine that tries to find results based on the semantic meaning of the query and the content, not keyword mathching/counting. \n",
"From the [official documentation](https://learn.microsoft.com/en-us/azure/search/semantic-search-overview):\n",
"\n",
Expand Down Expand Up @@ -467,7 +456,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 12,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -533,14 +522,6 @@
" \"targetFieldName\" : \"persons\"\n",
" },\n",
" {\n",
" \"sourceFieldName\" : \"/document/pages/*/locations/*\", \n",
" \"targetFieldName\" : \"locations\"\n",
" },\n",
" {\n",
" \"sourceFieldName\": \"/document/pages/*/organizations/*\",\n",
" \"targetFieldName\": \"organizations\"\n",
" },\n",
" {\n",
" \"sourceFieldName\": \"/document/pages/*/urls/*\",\n",
" \"targetFieldName\": \"urls\"\n",
" },\n",
Expand Down Expand Up @@ -569,7 +550,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -586,7 +567,7 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 25,
"metadata": {
"tags": []
},
Expand All @@ -597,7 +578,7 @@
"text": [
"200\n",
"Status: inProgress\n",
"Items Processed: 390\n",
"Items Processed: 990\n",
"True\n"
]
}
Expand All @@ -617,15 +598,102 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**When the indexer finishes running we will have all 9.8k documents indexed in our Search Engine!.**"
"**When the indexer finishes running we will have all 9.8k documents indexed in your Search Engine!.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creation of its corresponding vector-based index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Azure Cognitive Search has now vector search capabilities** ([Watch this video](https://aka.ms/Vector_SearchSnackableVideo)). The advantages of vector search in Azure Cognitive Search include its integration with other capabilities of Azure Cognitive Search, the ability to use any type of data (text, image, audio, video, etc) from diverse Azure datastores to inform a single generative AI-powered application, and the support of vector fields in the search indexes. It also offers pure vector search, hybrid retrieval, and a sophisticated re-ranking system powered by Bing in a single integrated solution (check the release [blog site](https://techcommunity.microsoft.com/t5/azure-ai-services-blog/announcing-vector-search-in-azure-cognitive-search-public/ba-p/3872868)).\n",
"\n",
"\n",
"![vector-search](https://techcommunity.microsoft.com/t5/image/serverpage/image-id/489211i001E2B9B34F483C2/image-dimensions/876x416?v=v2)\n",
"\n",
"\n",
"**The main limitations (for now) of vector search in Azure Cognitive Search are:**\n",
"\n",
"- It does not generate vector embeddings for the content. Users need to provide the embeddings themselves by using a service such as Azure OpenAI.\n",
"- There is not field type for Collection of vectors, meaning that each document in the vector-based index must be either a small document or a chunk of a bigger document.\n",
"\n",
"We are going to come back to these limitations and solve them in the next notebooks, but for now let's just create our corresponding vector-based index"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"201\n",
"True\n"
]
}
],
"source": [
"index_payload = {\n",
" \"name\": index_name + \"-vector\",\n",
" \"fields\": [\n",
" {\"name\": \"id\", \"type\": \"Edm.String\", \"key\": \"true\", \"filterable\": \"true\" },\n",
" {\"name\": \"title\",\"type\": \"Edm.String\",\"searchable\": \"true\",\"retrievable\": \"true\"},\n",
" {\"name\": \"chunk\",\"type\": \"Edm.String\",\"searchable\": \"true\",\"retrievable\": \"true\"},\n",
" {\"name\": \"chunkVector\",\"type\": \"Collection(Edm.Single)\",\"searchable\": \"true\",\"retrievable\": \"true\",\"dimensions\": 1536,\"vectorSearchConfiguration\": \"vectorConfig\"},\n",
" {\"name\": \"name\", \"type\": \"Edm.String\", \"searchable\": \"true\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},\n",
" {\"name\": \"location\", \"type\": \"Edm.String\", \"searchable\": \"false\", \"retrievable\": \"true\", \"sortable\": \"false\", \"filterable\": \"false\", \"facetable\": \"false\"},\n",
"\n",
" ],\n",
" \"vectorSearch\": {\n",
" \"algorithmConfigurations\": [\n",
" {\n",
" \"name\": \"vectorConfig\",\n",
" \"kind\": \"hnsw\"\n",
" }\n",
" ]\n",
" },\n",
" \"semantic\": {\n",
" \"configurations\": [\n",
" {\n",
" \"name\": \"my-semantic-config\",\n",
" \"prioritizedFields\": {\n",
" \"titleField\": {\n",
" \"fieldName\": \"title\"\n",
" },\n",
" \"prioritizedContentFields\": [\n",
" {\n",
" \"fieldName\": \"chunk\"\n",
" }\n",
" ],\n",
" \"prioritizedKeywordsFields\": []\n",
" }\n",
" }\n",
" ]\n",
" }\n",
"}\n",
"\n",
"r = requests.put(os.environ['AZURE_SEARCH_ENDPOINT'] + \"/indexes/\" + index_name + \"-vector\",\n",
" data=json.dumps(index_payload), headers=headers, params=params)\n",
"print(r.status_code)\n",
"print(r.ok)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reference\n",
"# References\n",
"\n",
"- https://learn.microsoft.com/en-us/azure/search/cognitive-search-tutorial-blob\n",
"- https://github.com/Azure-Samples/azure-search-python-samples/blob/main/Tutorial-AI-Enrichment/PythonTutorial-AzureSearch-AIEnrichment.ipynb\n",
"- https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/search/azure-search-documents/samples\n",
"- https://learn.microsoft.com/en-us/azure/search/search-get-started-python\n",
"- https://github.com/Azure-Samples/azure-search-python-samples/blob/main/Tutorial-AI-Enrichment/PythonTutorial-AzureSearch-AIEnrichment.ipynb"
Expand Down
Loading

0 comments on commit 5bde842

Please sign in to comment.