In this homework, there will be two sections, the first session focus on theoretical questions related to Kafka and streaming concepts and the second session asks to create a small streaming application using preferred programming language (Python or Java).
Please select the statements that are correct
- Kafka Node is responsible to store topics [x]
- Zookeeper is removed from Kafka cluster starting from version 4.0 [x]
- Retention configuration ensures the messages not get lost over specific period of time. [x]
- Group-Id ensures the messages are distributed to associated consumers [x]
Please select the Kafka concepts that support reliability and availability
- Topic Replication [x]
- Topic Partioning
- Consumer Group Id
- Ack All [x]
Please select the Kafka concepts that support scaling
- Topic Replication
- Topic Paritioning [x]
- Consumer Group Id [x]
- Ack All
Please select the attributes that are good candidates for partitioning key. Consider cardinality of the field you have selected and scaling aspects of your application
- payment_type [x]
- vendor_id [x]
- passenger_count
- total_amount
- tpep_pickup_datetime
- tpep_dropoff_datetime
Which configurations below should be provided for Kafka Consumer but not needed for Kafka Producer
- Deserializer Configuration [x]
- Topics Subscription [x]
- Bootstrap Server
- Group-Id [x]
- Offset [x]
- Cluster Key and Cluster-Secret
Please implement a streaming application, for finding out popularity of PUlocationID across green and fhv trip datasets. Please use the datasets fhv_tripdata_2019-01.csv.gz and green_tripdata_2019-01.csv.gz
PS: If you encounter memory related issue, you can use the smaller portion of these two datasets as well, it is not necessary to find exact number in the question.
Your code should include following
- Producer that reads csv files and publish rides in corresponding kafka topics (such as rides_green, rides_fhv)
- Pyspark-streaming-application that reads two kafka topics and writes both of them in topic rides_all and apply aggregations to find most popular pickup location.
- Form for submitting: https://forms.gle/rK7268U92mHJBpmW7
- You can submit your homework multiple times. In this case, only the last submission will be used.
Deadline: 13 March (Monday), 22:00 CET
We will publish the solution here after deadline#
For Question 6 ensure,
-
Download fhv_tripdata_2019-01.csv and green_tripdata_2019-01.csv under resources/fhv_tripdata and resources/green_tripdata resprctively. ps: You need to unzip the compressed files
-
Update the client.properties settings using your Confluent Cloud api keys and cluster.
-
And create the topics(all_rides, fhv_taxi_rides, green_taxi_rides) in Confluent Cloud UI
-
Run Producers for two datasets
python3 producer_confluent --type green
python3 producer_confluent --type fhv
- Run pyspark streaming
./spark-submit.sh streaming_confluent.py