-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Migrating to Apache Kafka for Enhanced Data Processing and Reliability #265
Comments
I can't speak to whether Kafka specifically is the right tool for this, but if this presented an opportunity to simplify the MQTT / Data ingest part of the rails app I would very much welcome it - it's one of the areas with the most complexity and the least test coverage at the moment, so making changes there in its current state is very risky (see #263 ) Moving the data pipeline out of rails as far as possible (and using it only as an API frontend over the datastore) i think would be the ideal. Something like Spark might well be very useful to replace portions of the current data ingest pipeline in that case |
I'm going to start having a look at the practicalities of this - Kafka Streams might be even better than Spark, and there's a MQTT-to Kafka bridge available which could potentially simplify things considerably. I'm gonna spend some time next week working out what would need to be replaced / rewritten in this system, and what the interfaces and boundaries would be with our current architecture, which should give us a better idea of costs and benefits. |
@oscgonfer and I had a chat this morning with Rune and Robert from NILU about how they use Kafka as part of the CitiObs project which was quite illuminating. A few useful points:
I'll add more here if anything occurs to me! I think a useful next step when we have some more time would be to look into doing a little spike similar to my previous one using the Confluent stack to see whether that offers any advantages. |
Just dropping a quick message here. |
Issue Description
The existing data ingestion architecture, which relies on MQTT for IOT communication, KairosDB for Time Series storage, and a set of Rails functions orchestrated with Sidekiq and Redis, presents certain challenges that we should address. I propose evaluating the migration to a unified event streaming platform, Apache Kafka, while leveraging tools like karafka for Rails or Racecar, among others.
Design Goals
Improve Data Durability and Availability
By transitioning to Apache Kafka, we can enhance data durability and availability. Kafka provides robust data storage and replication mechanisms, ensuring data resilience even in the face of failures.
Make Internal Data Flows More Explicit
The proposed architecture will make data flows within the system more explicit and easier to understand. Kafka's topic-based approach allows for clear data segregation and routing.
Reduce Code Base Complexity
Simplifying the architecture by consolidating on Kafka can potentially help reduce the codebase's complexity. This simplification can lead to easier maintenance and troubleshooting.
Address Potential Issues with the current MQTT Handler Gem
The current MQTT handler gem may have limitations or issues that we can overcome by connecting MQTT to Kafka directly. This might provide more flexibility and robustness in handling MQTT data.
Additional Benefits
Integration with Apache Spark
The proposed Kafka-based architecture can seamlessly integrate with Apache Spark. That opens up possibilities for utilizing Spark Streaming to process data in real-time. The architecture might also help to bring postprocessing tasks currently using the standard public API closer to the platform while keeping them as independent Python definitions within the smartcitizen-data framework.
Next steps
The text was updated successfully, but these errors were encountered: