This repository contains the solution for the API ingestion challenge. The challenge required ingesting data from an API into AWS on a cadence aligned with the frequency of changes in the data. Below is a summary of the solution:
-
Version Control:
- The codebase is versioned using Git.
-
Infrastructure-as-Code (IaC):
- AWS CDK v2 is used to define all AWS resources.
-
Lambda for Data Ingestion:
- AWS Lambda functions are utilized for ingesting data
- Data ingestion Lambda is triggered by AWS EventBridge rule on an hourly basis.
-
Data Storage:
- Data is ingested into a staging s3 bucket.
- Separate buckets are used for staging layer (raw json data) and raw layer (hive partition format csv data)
-
ETL Step:
- Data is transformed and moved to a partitioned raw layer.
-
Programming Language:
- Python is used for implementing both Lambda functions and CDK stack.
- AWS CLI configured with appropriate permissions.
- Python installed.
- CDK installed
- Any additional packages for Lambda functino will either need to be addded as a layer or included within the Lambda folder.
Most of the steps have been already performed. To replicate the deployment,
- Activate virtual environment within
cdk
folder. - Configure profile/credentials within AWS CLI.
- Run cdk bootstrap and deploy commands.
- Use
Step 5
command to rebuild lambda zip files on code change. - Manually subscribe to SNS topic after deployment to receive notifications.
Follow these steps to deploy the AWS CDK data pipeline:
-
Create CDK Directory:
mkdir cdk cd cdk
-
Initialize CDK App:
Initialize a new CDK app using Python as the language:cdk init app --language python
-
Set up Virtual Environment:
Create and activate a virtual environment to isolate dependencies:python -m venv .venv .\.venv\Scripts\activate # For Windows
-
Update pip and Install Dependencies:
Ensure pip is up-to-date and install project dependencies:python -m pip install --upgrade pip pip install -r requirements.txt
-
Zip Lambda Functions:
Navigate to the Lambda function directories and zip their contents:cd ../src/lambda_ingestion_api zip -r ../../cdk/lambda_ingestion_api.zip * cd ../lambda_dq zip -r ../../cdk/lambda_dq.zip *
-
Navigate to CDK Directory:
cd ../../cdk
-
Copy CDK Stack file:
Copy the CDK stack Python file to the CDK directory:copy ../src/cdk_stack.py cdk/.
-
Bootstrap AWS Environment:
Bootstrap the AWS environment to prepare for CDK deployment:cdk bootstrap
-
Deploy CDK Stack:
Deploy the CDK stack to AWS:cdk deploy
cdk_stack.py
script defines an AWS CDK stack that automates the setup of infrastructure for data ingestion and quality checks. Below is a summary of the key components:
- DataFoundryStagingLayer: Used for staging data.
- DataFoundryRawLayer: Used for storing raw data.
- DataIngestionTopic: Provides notifications for data ingestion processes.
-
Data Ingestion Lambda:
- Triggered by an EventBridge rule.
- Writes data to the staging bucket.
- Publishes notifications to the SNS topic.
-
Data Quality Lambda:
- Performs data quality checks and partitioning.
- Triggered when new files are added to staging bucket.
- Reads from the staging bucket.
- Writes processed data to the raw bucket.
- Publishes notifications to the SNS topic.
- Lambda functions are granted permissions to read from/write to S3 buckets and publish to the SNS topic.
- DataIngestionRule: Triggers the data ingestion Lambda at regular intervals (every 1 hour).
- Configured on the staging bucket to trigger the data quality Lambda when new files are uploaded.
requests
: Library for making HTTP requests.
- Fetches data from an API endpoint and ingests it into an Amazon S3 bucket.
- Currently fetches hourly forecast data from weather.gov for New York City. (https://api.weather.gov/gridpoints/OKX/36,36/forecast/hourly)
- Handles errors during data fetching or ingestion and publishes notifications to an SNS topic.
pandas
: Libraries for data manipulation.
- Performs data quality checks and partitioning on ingested data before storing it into an Amazon S3 bucket.
- Currently only checks for empty data, but more checks can be added based on requirement.
- Performs simple data manipulation, such as, renaming columns and adding new column for temperature in Celsius
- Handles errors during data processing or writing and publishes notifications to an SNS topic.