AWS CDK Data Pipeline Deployment and Documentation

Introduction

This repository contains the solution for the API ingestion challenge. The challenge required ingesting data from an API into AWS on a cadence aligned with the frequency of changes in the data. Below is a summary of the solution:

Version Control:
- The codebase is versioned using Git.
Infrastructure-as-Code (IaC):
- AWS CDK v2 is used to define all AWS resources.
Lambda for Data Ingestion:
- AWS Lambda functions are utilized for ingesting data
- Data ingestion Lambda is triggered by AWS EventBridge rule on an hourly basis.
Data Storage:
- Data is ingested into a staging s3 bucket.
- Separate buckets are used for staging layer (raw json data) and raw layer (hive partition format csv data)
ETL Step:
- Data is transformed and moved to a partitioned raw layer.
Programming Language:
- Python is used for implementing both Lambda functions and CDK stack.

Prerequisites

AWS CLI configured with appropriate permissions.
Python installed.
CDK installed
Any additional packages for Lambda functino will either need to be addded as a layer or included within the Lambda folder.

Setup Instructions

Most of the steps have been already performed. To replicate the deployment,

Activate virtual environment within cdk folder.
Configure profile/credentials within AWS CLI.
Run cdk bootstrap and deploy commands.
Use Step 5 command to rebuild lambda zip files on code change.
Manually subscribe to SNS topic after deployment to receive notifications.

Follow these steps to deploy the AWS CDK data pipeline:

Create CDK Directory:
```
mkdir cdk
cd cdk
```
Initialize CDK App:
Initialize a new CDK app using Python as the language:
```
cdk init app --language python
```
Set up Virtual Environment:
Create and activate a virtual environment to isolate dependencies:
```
python -m venv .venv
.\.venv\Scripts\activate  # For Windows
```
Update pip and Install Dependencies:
Ensure pip is up-to-date and install project dependencies:
```
python -m pip install --upgrade pip
pip install -r requirements.txt
```

Zip Lambda Functions:
Navigate to the Lambda function directories and zip their contents:

cd ../src/lambda_ingestion_api
zip -r ../../cdk/lambda_ingestion_api.zip *

cd ../lambda_dq
zip -r ../../cdk/lambda_dq.zip *

Navigate to CDK Directory:
```
cd ../../cdk
```
Copy CDK Stack file:
Copy the CDK stack Python file to the CDK directory:
```
copy ../src/cdk_stack.py cdk/.
```
Bootstrap AWS Environment:
Bootstrap the AWS environment to prepare for CDK deployment:
```
cdk bootstrap
```
Deploy CDK Stack:
Deploy the CDK stack to AWS:
```
cdk deploy
```

Architecture

CDK Stack Definition

cdk_stack.py script defines an AWS CDK stack that automates the setup of infrastructure for data ingestion and quality checks. Below is a summary of the key components:

S3 Buckets:

DataFoundryStagingLayer: Used for staging data.
DataFoundryRawLayer: Used for storing raw data.

SNS Topic:

DataIngestionTopic: Provides notifications for data ingestion processes.

Lambda Functions:

Data Ingestion Lambda:
- Triggered by an EventBridge rule.
- Writes data to the staging bucket.
- Publishes notifications to the SNS topic.
Data Quality Lambda:
- Performs data quality checks and partitioning.
- Triggered when new files are added to staging bucket.
- Reads from the staging bucket.
- Writes processed data to the raw bucket.
- Publishes notifications to the SNS topic.

Permissions:

Lambda functions are granted permissions to read from/write to S3 buckets and publish to the SNS topic.

EventBridge Rule:

DataIngestionRule: Triggers the data ingestion Lambda at regular intervals (every 1 hour).

S3 Event Notification:

Configured on the staging bucket to trigger the data quality Lambda when new files are uploaded.

Data Ingestion Lambda Function Documentation

Dependencies:

requests: Library for making HTTP requests.

Functionality:

Fetches data from an API endpoint and ingests it into an Amazon S3 bucket.
Currently fetches hourly forecast data from weather.gov for New York City. (https://api.weather.gov/gridpoints/OKX/36,36/forecast/hourly)
Handles errors during data fetching or ingestion and publishes notifications to an SNS topic.

Data Quality Lambda Function Documentation

Dependencies:

pandas : Libraries for data manipulation.

Functionality:

Performs data quality checks and partitioning on ingested data before storing it into an Amazon S3 bucket.
Currently only checks for empty data, but more checks can be added based on requirement.
Performs simple data manipulation, such as, renaming columns and adding new column for temperature in Celsius
Handles errors during data processing or writing and publishes notifications to an SNS topic.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
cdk		cdk
img		img
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS CDK Data Pipeline Deployment and Documentation

Introduction

Prerequisites

Setup Instructions

Architecture

CDK Stack Definition

S3 Buckets:

SNS Topic:

Lambda Functions:

Permissions:

EventBridge Rule:

S3 Event Notification:

Data Ingestion Lambda Function Documentation

Dependencies:

Functionality:

Data Quality Lambda Function Documentation

Dependencies:

Functionality:

About

Releases

Packages

Languages

sidarth7/data_foundry_challenge

Folders and files

Latest commit

History

Repository files navigation

AWS CDK Data Pipeline Deployment and Documentation

Introduction

Prerequisites

Setup Instructions

Architecture

CDK Stack Definition

S3 Buckets:

SNS Topic:

Lambda Functions:

Permissions:

EventBridge Rule:

S3 Event Notification:

Data Ingestion Lambda Function Documentation

Dependencies:

Functionality:

Data Quality Lambda Function Documentation

Dependencies:

Functionality:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages