📚 A course brought to you by the Data Minded Academy.
These are the exercises used in the course Data Pipeline Part 2 at DSTI.
The course has been developed by instructors at Data Minded. The
exercises are meant to be completed in the lexicographical order determined by
name of their parent folders. That is, exercises inside the folder b_foo
should be completed before those in c_bar
, but both should come after those
of a_foo_bar
.
- Introduce good data engineering practices.
- Illustrate modular and easily testable data transformation pipelines using PySpark.
- Illustrate PySpark concepts, like lazy evaluation, caching & partitioning. Not limited to these three though.
- People working with (Py)Spark Structured Streaming and Apache Kafka or soon to be working with it.
- Familiar with Python functions, variables and the container data types of
list
,tuple
,dict
, andset
.
Lecturer first sets the foundations right for Python development and gradually builds up to Apache Kafka and Spark Structured Streaming data pipelines.
There is a high degree of participation expected from the students: they will need to write code themselves and reason on topics, so that they can better retain the knowledge.
Note: this course is not about writing the best streaming pipelines possible. There are many ways to skin a cat, in this course we show one (or sometimes a few), which should be suitable for the level of the participants.
Follow these instructions to set up JDK 11, Hadoop WinUtils, Spark binaries and environment variables on Windows/x64 System: Click Here
Spark Structured Streaming hands-on exercises
Open a new terminal and make sure you're in the kafka_spark_streaming
directory. Then, run:
pip install -r requirements.txt
This will install any dependencies you might need to run this project in your virtual environment.
You will also need to download nmap for Windows.
To use nmap
, start a new terminal and type ncat -lk 9999
to start the listener.
You can send text to port 9999 just by typing in the same terminal.
Check out word_count.py and implement the pure
python function transform
. But first, make sure you understand the code and what it does.
You will start a new terminal with an ncat
listener as explained before.
Count how many times each word appears in real-time.
Test your function with the test_word_count.py test file.
Check out file_streaming.py and implement the pure
python function transform
. But first, make sure you understand the code and what it does.
Have a look at the JSON files in resources > invoices-json
.
You can see that there are nested fields. We want to flatten those files so that there are no
nested fields in the final JSON files.
Create a test file to test your function from exercise 2 (HINT: The invoice schema is already written in the utils > invoice_schema.py
).
Repeat the previous exercise but using the parquet files format instead of JSON. Adapt anything you need in your code
Translate the exercises from the first day to use Spark Structured Streaming
Apache Kafka hands-on exercises mixed with Spark Structured Streaming
Follow these instructions to set up Apache Kafka binaries and environment variables on Windows/x64 System: Click Here
IMPORTANT: Add Spark SQL Kafka package to your Spark Defaults C:\spark3\conf
folder, spark-defaults.conf
file:
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.4
Open a new terminal and make sure you're in the kafka_spark_streaming
directory. Then, run:
pip install -r requirements.txt
This will install any dependencies you might need to run this project in your virtual environment.
Read the invoices, that are being sent through kafka in real-time, with Spark Structured Streaming and flatten the nested JSONs
Check out file_streaming.py and implement the pure
python function transform
. But first, make sure you understand the code and what it does.
- Read Invoices from Kafka Topic
- Create Notification Record (with 3 fields):
- {"CustomerCardNo": "243252", "TotalAmount": 11115.0, "EarnedLoyaltyPoints": 2222.6}
- The column named
EarnedLoyaltyPoints
is a new column that you have to create, it is 20% of theTotalAmount
column
- Send Notification Record to Kafka Topic
- Kafka Topic receives data as key-value pair, send the
invoice number
as a key and thenotification record
as a value
- Kafka Topic receives data as key-value pair, send the
Check out notification.py and implement the pure
python functions transform
and get_notification_dataframe
. But first, make sure you understand the code and what it does.
Let's do exercise 1 and exercise 2 at once. So, we'll be reading from the invoice topic and we'll write to the outpu files at the same time we'll write to the notifications topic. This will save some execution time and resources since we'll only need to read from Kafka once and we can compute the transformations at the same time.
Check out multi_query.py and implement the pure
python functions transform_flatten_reports
and get_notification_df_transformed
.
But first, make sure you understand the code and what it does.
- Create a test file to test your functions.
- Have a look at Amazon MSK and try to deploy a Kafka Cluster.
In the resources
folder you will find all the input data (JSON, CSV, parquet files) you need to do the exercises.
The utils
folder contains the catalog.py
file which was also used during the first class with the Spark DataFrame API
but this time adapted for Spark Structured Streaming. invoice_schema.py
is the invoice schema of the messages written
to the kafka topic. Under kafka_scripts
you will find all the necessary scripts to start kafka (zookeeper, server,
create topics, start producer, start consumer). In kafka_commons.py
you will find common methods for all kafka related
exercises.