- Introduction to Kafka Streams Processor API
- Explanation of the Code
- Concepts Used in This Application
- Project Setup
- Missing Components
Apache Kafka Streams is a client library for building applications and microservices where the input and output data are stored in Kafka clusters. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology.
Kafka Streams simplifies the development of stream processing applications by providing a straightforward DSL for building robust and scalable stream processing applications in Java or Scala. It supports stateful and stateless processing, windowing, and time-based aggregations.
For more detailed information, refer to the official Kafka Streams documentation:
This Kafka Streams Example project demonstrates a comprehensive use of the Kafka Streams API to process streaming data. The project is structured into several Scala files, each serving a specific purpose in the stream processing pipeline.
- Applications.scala, ApplicationDetails.scala, ImportedApplications.scala: These files define case classes representing different types of data entities in our application. They are used to model the data that flows through our Kafka Streams application.
- Types.scala: This file defines type aliases and tuples used throughout the application, making the code more readable and maintainable.
- DataProcessor.scala: Contains the core processing logic. It defines a custom processor that joins data from different streams and manages state stores.
- Implicits.scala: Provides implicit SerDe (Serializer/Deserializer) functions for the data entities, enabling Kafka to serialize and deserialize these objects as they are sent to and received from topics.
- package.scala: Defines constants used throughout the application, such as topic names and state store names.
- KafkaStreamJoinsExample.scala: The main entry point of the application. It sets up the Kafka Streams topology, defining how different streams and tables are joined and processed.
- Stream-Table Joins: The application demonstrates how to perform joins between KStreams and KTables, which is a common operation in stream processing to enrich a real-time data stream with more static data.
- Stateful Processing: The use of state stores in
DataProcessor.scala
illustrates how to maintain and update state within a Kafka Streams application. - Retries: The application includes mechanisms to handle failed joins and retry processing, ensuring robustness in the face of data inconsistencies or operational issues.
The application simulates a scenario where application details are continuously streamed and need to be joined with corresponding application data. The result of this join is then processed and stored, demonstrating a typical use case in real-time data processing and aggregation.
This Kafka Streams Example serves as a comprehensive guide to understanding and implementing a Kafka Streams application using Scala, covering everything from basic setup to advanced stream processing techniques.
- KStream and KTable: Fundamental abstractions in Kafka Streams for representing real-time data streams and changelogs of state.
- State Stores: Local storage associated with stream processors for maintaining state.
- Processor API: Low-level API for stream processing, allowing more control over the processing logic.
- Serdes: Serialization and deserialization mechanisms used for converting data between Kafka's binary format and Java/Scala objects.
- Topology: The logical representation of the processing graph.
- Ensure Scala and sbt are installed on your system.
- Navigate to the project directory.
- Run
./gradlew data-source-applications:shadowJar
to build fat jar. Output will be indata-source-applications/build/libs/data-source-applications_2.12-0.1.0-all.jar
- Start your Kafka environment.
- Execute
create-topics.sh
to create the necessary Kafka topics. - Run
java -jar data-source-applications/build/libs/data-source-applications_2.12-0.1.0-all.jar
to start the Kafka Streams application.
This example is by no means production ready. It is lacking the following components.
- Tests
- Containerisation (docker & helm charts for deployment on kubernetes/openshift)
- Error handling (forwarding failures to error topics)
- Dependency injection (spring boot)
- Rocksdb as a backend for state stores (Highly recommended for production)