Skip to content

VMess1/nc_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Northcoders Data Engineering Final Project

This Northcoders Data Engineering Final Project is a pipeline which transforms an SQL database of a shop's sales records (in Online Transaction Processing "OLTP" format) to a structured data warehouse (in Online Analytical Processing "OLAP" format); all hosted in amazon web services (AWS). See the end of the ReadMe for images of the entity relationship diagrams (ERDs) for the initial and transformed databases.

The Pipeline

1. The pipeline's "extraction" Lambda function collects both archive and new data entries by scanning the database periodically for updates. It converts new, unique data to CSV files which are stored in an S3 bucket; and logs in CloudWatch. The database credentials are stored in Secrets Manager; and Systems Manager is used to store timestamps.

2. Any bucket upload event triggers a second "processing" Lambda function which transforms and normalises the data using Pandas DataFrames and stores them in parquet format in a second S3 bucket.

3. Finally, a third "storage" Lambda function scans the second bucket periodically for updates, which the pipeline converts back to SQL and loads to a data warehouse in star format.

The entire pipeline infrastructure is managed using Terraform.

Prerequisites

To use the pipeline, ensure you have met the following requirements:

  • You have installed the latest version of Python and set up a virtual environment in the nc_pipeline directory.
    python -m venv venv
    source venv/bin/activate
    
  • You have installed the required packages:
    make requirements
    

Using the Pipeline

To run tests, the test SQL databases must be created.

  • Create a .env file in the nc_pipeline directory with the following information:
TEST_HOST = 'localhost'
TEST_DATABASE = 'oltp_test'
TEST_DATA_WAREHOUSE = 'olap_test'
USER = <local sql username>
PASSWORD = <local sql password>
  • Create initial otlp test database:
psql -f tests/test_extraction/data/test-extraction.sql
  • Create transformed olap test database:
psql -f tests/test_storage/data/test-insertion.sql

To use the pipeline, the AWS CLI must be installed and the OLTPCredentials and OLAPCredentials must be stored in AWS Secrets Manager.

Contributors

ERD Images

OTLP Database ERD: SampleDB

OLAP Database ERD: SampleDW-Sales

About

DE project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published