This Northcoders Data Engineering Final Project is a pipeline which transforms an SQL database of a shop's sales records (in Online Transaction Processing "OLTP" format) to a structured data warehouse (in Online Analytical Processing "OLAP" format); all hosted in amazon web services (AWS). See the end of the ReadMe for images of the entity relationship diagrams (ERDs) for the initial and transformed databases.
1. The pipeline's "extraction" Lambda function collects both archive and new data entries by scanning the database periodically for updates. It converts new, unique data to CSV files which are stored in an S3 bucket; and logs in CloudWatch. The database credentials are stored in Secrets Manager; and Systems Manager is used to store timestamps.
2. Any bucket upload event triggers a second "processing" Lambda function which transforms and normalises the data using Pandas DataFrames and stores them in parquet format in a second S3 bucket.
3. Finally, a third "storage" Lambda function scans the second bucket periodically for updates, which the pipeline converts back to SQL and loads to a data warehouse in star format.
The entire pipeline infrastructure is managed using Terraform.
To use the pipeline, ensure you have met the following requirements:
- You have installed the latest version of Python and set up a virtual environment in the nc_pipeline directory.
python -m venv venv source venv/bin/activate
- You have installed the required packages:
make requirements
To run tests, the test SQL databases must be created.
- Create a .env file in the nc_pipeline directory with the following information:
TEST_HOST = 'localhost'
TEST_DATABASE = 'oltp_test'
TEST_DATA_WAREHOUSE = 'olap_test'
USER = <local sql username>
PASSWORD = <local sql password>
- Create initial otlp test database:
psql -f tests/test_extraction/data/test-extraction.sql
- Create transformed olap test database:
psql -f tests/test_storage/data/test-insertion.sql
To use the pipeline, the AWS CLI must be installed and the OLTPCredentials and OLAPCredentials must be stored in AWS Secrets Manager.
- Tom Avery @averz87 🧮
- Dylan Hickman-Singh @dylanhs33 ⚽
- Philippa Clarkson @philupa 🦔
- Philip Taylor @phil-taylor-sj ♟️
- Victoria Messam @VMess1 👾