Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kendra] Not able to add custom metadata for WebCrawler Datasource - create_data_source, update_data_Source #4293

Closed
1 task
ssmails opened this issue Oct 3, 2024 · 3 comments
Assignees
Labels
bug This issue is a confirmed bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. needs-triage This issue or PR still needs to be triaged.

Comments

@ssmails
Copy link

ssmails commented Oct 3, 2024

Describe the bug

Not able to add custom metadata for WebCrawler datasource via boto3 create_data_source, update_data_source

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

should be able to add custom metadata.

Current Behavior

Not able to add custom metadata for WebCrawler datasource via boto3 create_data_source, update_data_source

Reproduction Steps

When I try to update an existing datasource with the 'CustomDocumentEnrichmentConfiguration' or create a WebCrawler DataSource with 'CustomDocumentEnrichmentConfiguration'

I get below error-

adapter.kendraoperations.KendraAdapterException: Error creating Kendra data source: An error occurred (ValidationException) when calling the CreateDataSource operation: No document metadata configuration found for document attribute key 'data_source_id'.

ERROR:root:Error: An error occurred (ValidationException) when calling the UpdateDataSource operation: No document metadata configuration found for document attribute key 'website_creation_date'.   

Per researching further on this error , seem like it needs 'DocumentMetadataConfigurations' added prior.
How can we go about adding that ? I dont see it per the boto3 kendra calls
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/kendra.html

simply trying to add it in the create_data_source or the update_data_source calls , gives the below error.

Unknown parameter in input: "DocumentMetadataConfigurations", must be one of: Id, Name, IndexId, Configuration, VpcConfiguration, Description, Schedule, RoleArn, LanguageCode, CustomDocumentEnrichmentConfiguration   

Possible Solution

No response

Additional Information/Context

import time
import logging
from datetime import datetime

import boto3
#boto3.set_stream_logger('')
from botocore.client import ClientError



def webcrawler_update_metadata():
    try:
        logging.debug("START.")

        kendra = boto3.client(
            'kendra'
        )
        logging.info("Kendra client initialized successfully.")

        # Define the custom metadata
        custom_document_enrichment_configuration = {
            'InlineConfigurations': [
                {
                    'Target': {
                        'TargetDocumentAttributeKey': 'website_creation_date',
                        'TargetDocumentAttributeValue': {
                            'DateValue': datetime(2024, 10, 2).isoformat()  # Replace with the actual creation date
                        },
                        'TargetDocumentAttributeValueDeletion': False
                    }
                },
                {
                    'Target': {
                        'TargetDocumentAttributeKey': 'data_source_id',
                        'TargetDocumentAttributeValue': {
                            'StringValue': 'd8d5aa11-1261-4b5a-a406-60920d5120f1'  # Replace with your actual data source ID
                        },
                        'TargetDocumentAttributeValueDeletion': False
                    }
                }
            ]
        }

        # Define the document metadata configurations
        document_metadata_configurations = [
            {
                'Name': 'Author',
                'Type': 'STRING_VALUE'
            },
            {
                'Name': 'CreatedDate',
                'Type': 'DATE_VALUE'
            }
        ]


        # Update the data source
        response = kendra.update_data_source(
            Id='d8d5aa11-1261-4b5a-a406-60920d5120f1',
            IndexId='c5625f41-f9bb-47a8-adb4-8832b84e0254',
            Configuration={
                'WebCrawlerConfiguration': {
                    'Urls': {
                        'SeedUrlConfiguration': {
                            'SeedUrls': [
                                'https://stackoverflowteams.com/c/cisco-systems-outshift-rag/questions?tab=Newest',
                            ]
                        }
                    }
                }
            },
            # DocumentMetadataConfigurations=document_metadata_configurations
            CustomDocumentEnrichmentConfiguration={
                 'InlineConfigurations': [
                     {
                         'Target': {
                             'TargetDocumentAttributeKey': 'website_creation_date',
                             'TargetDocumentAttributeValue': {
                                 'DateValue': '2024-10-03T18:03:13Z'
                             }
                         }
                     },
                     {
                         'Target': {
                             'TargetDocumentAttributeKey': 'data_source_id',
                             'TargetDocumentAttributeValue': {
                                 'StringValue': 'global-ds-uuid'
                             }
                         }
                     }
                 ]
            }
        )
        print("update response:", response)
        time.sleep(300)

        response = kendra.start_data_source_sync_job(
            Id="your datasource id", 
            IndexId="your index" 
        )
        print("sync response:", response)
        sync_job_id = response['ExecutionId']
        logging.debug(f"Data source sync job started with ID: {sync_job_id}")
        time.sleep(600)

    except Exception as e:
        logging.error(f"Error: {str(e)}")


if __name__ == '__main__':
    print("before")
    webcrawler_update_metadata()

SDK version used

1.35.16

Environment details (OS name and version, etc.)

Mac

@ssmails ssmails added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Oct 3, 2024
@ssmails
Copy link
Author

ssmails commented Oct 3, 2024

@tim-finnigan can you kindly help on this. Thanks.

@tim-finnigan tim-finnigan self-assigned this Oct 3, 2024
@tim-finnigan tim-finnigan added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Oct 3, 2024
@ssmails
Copy link
Author

ssmails commented Oct 4, 2024

@tim-finnigan I figured out the issue. However, it is not documented sufficiently in boto3 docs or kendra docs.

  1. There needs to be an update_index call to first update the document_metadata_configurations to the index.
  2. Followed by the create_data_source call to use the CustomDocumentEnrichmentConfiguration with those metadata updated in Fix resource loading and return default values #1.

Closing this.

@ssmails ssmails closed this as completed Oct 4, 2024
Copy link

github-actions bot commented Oct 4, 2024

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. needs-triage This issue or PR still needs to be triaged.
Projects
None yet
Development

No branches or pull requests

2 participants