[Kendra] Not able to add custom metadata for WebCrawler Datasource - create_data_source, update_data_Source #4293

ssmails · 2024-10-03T19:27:34Z

Describe the bug

Not able to add custom metadata for WebCrawler datasource via boto3 create_data_source, update_data_source

Regression Issue

Select this option if this issue appears to be a regression.

Expected Behavior

should be able to add custom metadata.

Current Behavior

Not able to add custom metadata for WebCrawler datasource via boto3 create_data_source, update_data_source

Reproduction Steps

When I try to update an existing datasource with the 'CustomDocumentEnrichmentConfiguration' or create a WebCrawler DataSource with 'CustomDocumentEnrichmentConfiguration'

I get below error-

adapter.kendraoperations.KendraAdapterException: Error creating Kendra data source: An error occurred (ValidationException) when calling the CreateDataSource operation: No document metadata configuration found for document attribute key 'data_source_id'.

ERROR:root:Error: An error occurred (ValidationException) when calling the UpdateDataSource operation: No document metadata configuration found for document attribute key 'website_creation_date'.

Per researching further on this error , seem like it needs 'DocumentMetadataConfigurations' added prior.
How can we go about adding that ? I dont see it per the boto3 kendra calls
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/kendra.html

simply trying to add it in the create_data_source or the update_data_source calls , gives the below error.

Unknown parameter in input: "DocumentMetadataConfigurations", must be one of: Id, Name, IndexId, Configuration, VpcConfiguration, Description, Schedule, RoleArn, LanguageCode, CustomDocumentEnrichmentConfiguration

Possible Solution

No response

Additional Information/Context

import time
import logging
from datetime import datetime

import boto3
#boto3.set_stream_logger('')
from botocore.client import ClientError



def webcrawler_update_metadata():
    try:
        logging.debug("START.")

        kendra = boto3.client(
            'kendra'
        )
        logging.info("Kendra client initialized successfully.")

        # Define the custom metadata
        custom_document_enrichment_configuration = {
            'InlineConfigurations': [
                {
                    'Target': {
                        'TargetDocumentAttributeKey': 'website_creation_date',
                        'TargetDocumentAttributeValue': {
                            'DateValue': datetime(2024, 10, 2).isoformat()  # Replace with the actual creation date
                        },
                        'TargetDocumentAttributeValueDeletion': False
                    }
                },
                {
                    'Target': {
                        'TargetDocumentAttributeKey': 'data_source_id',
                        'TargetDocumentAttributeValue': {
                            'StringValue': 'd8d5aa11-1261-4b5a-a406-60920d5120f1'  # Replace with your actual data source ID
                        },
                        'TargetDocumentAttributeValueDeletion': False
                    }
                }
            ]
        }

        # Define the document metadata configurations
        document_metadata_configurations = [
            {
                'Name': 'Author',
                'Type': 'STRING_VALUE'
            },
            {
                'Name': 'CreatedDate',
                'Type': 'DATE_VALUE'
            }
        ]


        # Update the data source
        response = kendra.update_data_source(
            Id='d8d5aa11-1261-4b5a-a406-60920d5120f1',
            IndexId='c5625f41-f9bb-47a8-adb4-8832b84e0254',
            Configuration={
                'WebCrawlerConfiguration': {
                    'Urls': {
                        'SeedUrlConfiguration': {
                            'SeedUrls': [
                                'https://stackoverflowteams.com/c/cisco-systems-outshift-rag/questions?tab=Newest',
                            ]
                        }
                    }
                }
            },
            # DocumentMetadataConfigurations=document_metadata_configurations
            CustomDocumentEnrichmentConfiguration={
                 'InlineConfigurations': [
                     {
                         'Target': {
                             'TargetDocumentAttributeKey': 'website_creation_date',
                             'TargetDocumentAttributeValue': {
                                 'DateValue': '2024-10-03T18:03:13Z'
                             }
                         }
                     },
                     {
                         'Target': {
                             'TargetDocumentAttributeKey': 'data_source_id',
                             'TargetDocumentAttributeValue': {
                                 'StringValue': 'global-ds-uuid'
                             }
                         }
                     }
                 ]
            }
        )
        print("update response:", response)
        time.sleep(300)

        response = kendra.start_data_source_sync_job(
            Id="your datasource id", 
            IndexId="your index" 
        )
        print("sync response:", response)
        sync_job_id = response['ExecutionId']
        logging.debug(f"Data source sync job started with ID: {sync_job_id}")
        time.sleep(600)

    except Exception as e:
        logging.error(f"Error: {str(e)}")


if __name__ == '__main__':
    print("before")
    webcrawler_update_metadata()

SDK version used

1.35.16

Environment details (OS name and version, etc.)

Mac

The text was updated successfully, but these errors were encountered:

ssmails · 2024-10-03T19:28:11Z

@tim-finnigan can you kindly help on this. Thanks.

ssmails · 2024-10-04T17:31:05Z

@tim-finnigan I figured out the issue. However, it is not documented sufficiently in boto3 docs or kendra docs.

There needs to be an update_index call to first update the document_metadata_configurations to the index.
Followed by the create_data_source call to use the CustomDocumentEnrichmentConfiguration with those metadata updated in Fix resource loading and return default values #1.

Closing this.

github-actions · 2024-10-04T17:31:23Z

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

ssmails added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Oct 3, 2024

tim-finnigan self-assigned this Oct 3, 2024

tim-finnigan added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Oct 3, 2024

ssmails closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kendra] Not able to add custom metadata for WebCrawler Datasource - create_data_source, update_data_Source #4293

[Kendra] Not able to add custom metadata for WebCrawler Datasource - create_data_source, update_data_Source #4293

ssmails commented Oct 3, 2024

ssmails commented Oct 3, 2024

ssmails commented Oct 4, 2024

github-actions bot commented Oct 4, 2024