This data engineering case was made for Digital Power. It is an ETL pipeline that gives insight in the estimated number of passengers on board of flights arriving at - and departing from Schiphol Airport. It can be run both locally as well as in the cloud using AWS EC2 and Apache Airflow.
- It extracts
json
data from the Schiphol API, as well as.csv
files with information on aircraft capacity and passenger load factors. - It transforms the extracted data using Pandas into a dataframe.
- It loads the dataframe as a
.csv
file into AWS S3.
You can either run the project locally or in the cloud. I will give instructions for both.
-
Install Python 3.10.4 and create a virtual environment.
-
Clone the repository.
git clone https://github.com/has-ctrl/flights-etl-pipeline.git
-
Install the dependencies using
pip
.pip install -r requirements.txt
-
Get Schiphol Flight API Keys and enter keys in
.secret/api_creds.json
. -
Create AWS account and enter API keys and S3 bucket_url in
.secret/aws_creds.json
. -
Run the
test.py
script and view resulting.csv
file in your S3 bucket.python test.py
-
Create AWS account, launch an EC2 instance (Ubuntu OS with at least a 4GB RAM t3.medium instance), and save the key-pair
.pem
file in working directory. -
Connect to your instance using an SSH client with the key-pair
.pem
file like so:ssh -i "KEY-PAIR.pem" [email protected]
-
Install Python and
Apache Airflow
and clone the repository.sudo apt-get update sudo apt install python3-pip sudo pip install apache-airflow cd airflow/ git clone https://github.com/has-ctrl/flights-etl-pipeline.git
-
Install the dependencies using
pip
.pip install -r requirements.txt
-
Get Schiphol Flight API Keys and enter keys in
.secret/api_creds.json
. -
Create AWS account and enter API keys and S3 bucket_url in
.secret/aws_creds.json
. -
Edit the
airflow.cfg
config file by changing the dags_folder to/home/ubuntu/airflow/flights-etl-pipeline
and settingenable_xcom_pickling = True
. -
Run Airflow and copy username and password from the logged output.
airflow standalone
-
Optionally trigger the scheduler manually by logging into Airflow Console using the Public IPv4 DNS in your web-browser (e.g.
ec2-X-XX-XX-XX.eu-central-1.compute.amazonaws.com:8080/
) after opening port 8080 on your machine.