Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Migrating to Apache Kafka for Enhanced Data Processing and Reliability #265

Open
4 tasks
pral2a opened this issue Oct 10, 2023 · 4 comments
Open
4 tasks
Assignees

Comments

@pral2a
Copy link
Member

pral2a commented Oct 10, 2023

Issue Description

The existing data ingestion architecture, which relies on MQTT for IOT communication, KairosDB for Time Series storage, and a set of Rails functions orchestrated with Sidekiq and Redis, presents certain challenges that we should address. I propose evaluating the migration to a unified event streaming platform, Apache Kafka, while leveraging tools like karafka for Rails or Racecar, among others.

Design Goals

Improve Data Durability and Availability

By transitioning to Apache Kafka, we can enhance data durability and availability. Kafka provides robust data storage and replication mechanisms, ensuring data resilience even in the face of failures.

Make Internal Data Flows More Explicit

The proposed architecture will make data flows within the system more explicit and easier to understand. Kafka's topic-based approach allows for clear data segregation and routing.

Reduce Code Base Complexity

Simplifying the architecture by consolidating on Kafka can potentially help reduce the codebase's complexity. This simplification can lead to easier maintenance and troubleshooting.

Address Potential Issues with the current MQTT Handler Gem

The current MQTT handler gem may have limitations or issues that we can overcome by connecting MQTT to Kafka directly. This might provide more flexibility and robustness in handling MQTT data.

Additional Benefits

Integration with Apache Spark

The proposed Kafka-based architecture can seamlessly integrate with Apache Spark. That opens up possibilities for utilizing Spark Streaming to process data in real-time. The architecture might also help to bring postprocessing tasks currently using the standard public API closer to the platform while keeping them as independent Python definitions within the smartcitizen-data framework.

Next steps

  • Validate the existing overall architecture proposal
  • Compare Kafka to other existing solutions (long term support, licensing, complexity, etc.)
  • Estimate required resources and phases
  • Look for similar open source Rails projects integrating Kafka for architecture best practices
@timcowlishaw
Copy link
Contributor

timcowlishaw commented Oct 11, 2023

I can't speak to whether Kafka specifically is the right tool for this, but if this presented an opportunity to simplify the MQTT / Data ingest part of the rails app I would very much welcome it - it's one of the areas with the most complexity and the least test coverage at the moment, so making changes there in its current state is very risky (see #263 )

Moving the data pipeline out of rails as far as possible (and using it only as an API frontend over the datastore) i think would be the ideal. Something like Spark might well be very useful to replace portions of the current data ingest pipeline in that case

@timcowlishaw
Copy link
Contributor

I'm going to start having a look at the practicalities of this - Kafka Streams might be even better than Spark, and there's a MQTT-to Kafka bridge available which could potentially simplify things considerably.

I'm gonna spend some time next week working out what would need to be replaced / rewritten in this system, and what the interfaces and boundaries would be with our current architecture, which should give us a better idea of costs and benefits.

@oscgonfer oscgonfer added this to the Post-refactor milestone Nov 9, 2023
@timcowlishaw
Copy link
Contributor

@oscgonfer and I had a chat this morning with Rune and Robert from NILU about how they use Kafka as part of the CitiObs project which was quite illuminating.

A few useful points:

  1. They use Kafka to broker and queue incoming data, which is then consumed by Apache Spark jobs for validation and for ingestion into an HBase database. This is similar to our proposed use case, (but also includes the role that MQTT currently plays for us as they receive data via a REST API rather than a messaging protocol)
  2. They use the Confluent Platform on their own hardware (ie not confluent cloud) under the community licence - they had the same reservations at first as I had about leaning too hard into a proprietary platform but so far it has not caused any practical issues, and Confluent provides useful features (such as validation of incoming data against a schema at the point of ingest). This is especially useful for us as confluent maintain a MQTT proxy for kafka which seems difficult to use outside their platform.
  3. Brokering messages to different consumers who consume at different rates is not a problem but requires care - all this is configurable by the developer in Kafka.
  4. They enrich incoming messages with schema data from their postgres database at the point of ingest before hitting the broker, which might be a neat way of getting around some of the tricky parts of providing data to third party consumers that Oscar and I were discussing before Christmas.

I'll add more here if anything occurs to me! I think a useful next step when we have some more time would be to look into doing a little spike similar to my previous one using the Confluent stack to see whether that offers any advantages.

@oscgonfer oscgonfer modified the milestones: CitiObs, Long term Mar 20, 2024
@oscgonfer
Copy link
Contributor

Just dropping a quick message here.
I saw a potential alternative for this: https://nats.io/ @timcowlishaw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants