Proposal: Migrating to Apache Kafka for Enhanced Data Processing and Reliability #265

pral2a · 2023-10-10T16:33:56Z

Issue Description

The existing data ingestion architecture, which relies on MQTT for IOT communication, KairosDB for Time Series storage, and a set of Rails functions orchestrated with Sidekiq and Redis, presents certain challenges that we should address. I propose evaluating the migration to a unified event streaming platform, Apache Kafka, while leveraging tools like karafka for Rails or Racecar, among others.

Design Goals

Improve Data Durability and Availability

By transitioning to Apache Kafka, we can enhance data durability and availability. Kafka provides robust data storage and replication mechanisms, ensuring data resilience even in the face of failures.

Make Internal Data Flows More Explicit

The proposed architecture will make data flows within the system more explicit and easier to understand. Kafka's topic-based approach allows for clear data segregation and routing.

Reduce Code Base Complexity

Simplifying the architecture by consolidating on Kafka can potentially help reduce the codebase's complexity. This simplification can lead to easier maintenance and troubleshooting.

Address Potential Issues with the current MQTT Handler Gem

The current MQTT handler gem may have limitations or issues that we can overcome by connecting MQTT to Kafka directly. This might provide more flexibility and robustness in handling MQTT data.

Additional Benefits

Integration with Apache Spark

The proposed Kafka-based architecture can seamlessly integrate with Apache Spark. That opens up possibilities for utilizing Spark Streaming to process data in real-time. The architecture might also help to bring postprocessing tasks currently using the standard public API closer to the platform while keeping them as independent Python definitions within the smartcitizen-data framework.

Next steps

Validate the existing overall architecture proposal
Compare Kafka to other existing solutions (long term support, licensing, complexity, etc.)
Estimate required resources and phases
Look for similar open source Rails projects integrating Kafka for architecture best practices

timcowlishaw · 2023-10-11T05:10:21Z

I can't speak to whether Kafka specifically is the right tool for this, but if this presented an opportunity to simplify the MQTT / Data ingest part of the rails app I would very much welcome it - it's one of the areas with the most complexity and the least test coverage at the moment, so making changes there in its current state is very risky (see #263 )

Moving the data pipeline out of rails as far as possible (and using it only as an API frontend over the datastore) i think would be the ideal. Something like Spark might well be very useful to replace portions of the current data ingest pipeline in that case

timcowlishaw · 2023-10-27T07:29:09Z

I'm going to start having a look at the practicalities of this - Kafka Streams might be even better than Spark, and there's a MQTT-to Kafka bridge available which could potentially simplify things considerably.

I'm gonna spend some time next week working out what would need to be replaced / rewritten in this system, and what the interfaces and boundaries would be with our current architecture, which should give us a better idea of costs and benefits.

timcowlishaw · 2024-01-04T11:08:01Z

@oscgonfer and I had a chat this morning with Rune and Robert from NILU about how they use Kafka as part of the CitiObs project which was quite illuminating.

A few useful points:

They use Kafka to broker and queue incoming data, which is then consumed by Apache Spark jobs for validation and for ingestion into an HBase database. This is similar to our proposed use case, (but also includes the role that MQTT currently plays for us as they receive data via a REST API rather than a messaging protocol)
They use the Confluent Platform on their own hardware (ie not confluent cloud) under the community licence - they had the same reservations at first as I had about leaning too hard into a proprietary platform but so far it has not caused any practical issues, and Confluent provides useful features (such as validation of incoming data against a schema at the point of ingest). This is especially useful for us as confluent maintain a MQTT proxy for kafka which seems difficult to use outside their platform.
Brokering messages to different consumers who consume at different rates is not a problem but requires care - all this is configurable by the developer in Kafka.
They enrich incoming messages with schema data from their postgres database at the point of ingest before hitting the broker, which might be a neat way of getting around some of the tricky parts of providing data to third party consumers that Oscar and I were discussing before Christmas.

I'll add more here if anything occurs to me! I think a useful next step when we have some more time would be to look into doing a little spike similar to my previous one using the Confluent stack to see whether that offers any advantages.

oscgonfer · 2024-03-20T18:08:16Z

Just dropping a quick message here.
I saw a potential alternative for this: https://nats.io/ @timcowlishaw

pral2a added the enhancement label Oct 10, 2023

oscgonfer assigned timcowlishaw, pral2a and oscgonfer Oct 10, 2023

oscgonfer added this to the Post-refactor milestone Nov 9, 2023

oscgonfer mentioned this issue Dec 22, 2023

API issues #225

Open

oscgonfer modified the milestones: CitiObs, Long term Mar 20, 2024

pral2a mentioned this issue Mar 25, 2024

Add data_policy field in device #307

Closed

oscgonfer modified the milestones: Long term, Data Ingestion Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Migrating to Apache Kafka for Enhanced Data Processing and Reliability #265

Proposal: Migrating to Apache Kafka for Enhanced Data Processing and Reliability #265

pral2a commented Oct 10, 2023 •

edited

Loading

timcowlishaw commented Oct 11, 2023 •

edited

Loading

timcowlishaw commented Oct 27, 2023

timcowlishaw commented Jan 4, 2024

oscgonfer commented Mar 20, 2024

Proposal: Migrating to Apache Kafka for Enhanced Data Processing and Reliability #265

Proposal: Migrating to Apache Kafka for Enhanced Data Processing and Reliability #265

Comments

pral2a commented Oct 10, 2023 • edited Loading

Issue Description

Design Goals

Improve Data Durability and Availability

Make Internal Data Flows More Explicit

Reduce Code Base Complexity

Address Potential Issues with the current MQTT Handler Gem

Additional Benefits

Integration with Apache Spark

Next steps

timcowlishaw commented Oct 11, 2023 • edited Loading

timcowlishaw commented Oct 27, 2023

timcowlishaw commented Jan 4, 2024

oscgonfer commented Mar 20, 2024

pral2a commented Oct 10, 2023 •

edited

Loading

timcowlishaw commented Oct 11, 2023 •

edited

Loading