A platform for IOT timeseries analytics using Apache Spark, FiloDB and Flint ts library
A Big Data backed platform for analytics of timeseries IOT data in cloud.
Some of the features of this project are:
- Ease of reading IOT devices histories using metadata and time range using project-haystack format
- Helpers and utilities for time series data analytics and processing
- Temporal (preserving sorting by time) merge, joins and other operations using flint ts library
- Support for streaming, batch analytics and ad-hoc queries using Apache Spark and FiloDB (built on top of Apache Cassandra)
- Apache Spark
- Apache Cassandra
- Elasticsearch
- Apache Kafka
- Spark Thrift Server
- Zeppelin Notebook
- Superset
Apache Spark along with FiloDB library and Flint ts library is the core of this project.
Apache Spark™ is a fast and general engine for large-scale data processing.
Apache Spark can be used for batch processing, streaming, machine learning and graph analytics on data generated by iot devices.
The ability to analyze time series data at scale is critical for the success of finance and IoT applications based on Spark. Flint is Two Sigma's implementation of highly optimized time series operations in Spark. It performs truly parallel and rich analyses on time series data by taking advantage of the natural ordering in time series data to provide locality-based optimizations.
Flint ts library supports temporal (preserving sorting by time) joins, merges and other timeseries related operations like group by cycles (same timestamp), group by intervals, time based windows, etc.
FiloDB is a new open-source distributed, versioned, and columnar analytical database designed for modern streaming workloads.
FiloDB uses Apache Cassandra as data storage and built around Apache Spark framework. There is flexibility of data partitioning, sorting while storing, fast read/write and versioned data storage which is optimized for timeseries and events data.
Currently tested for: Apache Spark 2.0.2 Libraries forked at:
- https://github.com/itsmeccr/FiloDB (Branch: v0.4_spark2)
- https://github.com/itsmeccr/flint
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
Apache Cassandra is used as data storage (IOT histories) by FiloDB.
Elasticsearch is a distributed RESTful search engine built for the cloud.
Elasticsearch is used for storing and querying IOT metadata in project-haystack format.
Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
Apache Kafka is used as buffer in streaming uses cases.
Spark Thrift Server comes in Apache spark distribution and can be used to expose spark dataframes as SQL tables. It is used to connect histories in FiloDB and metadata in elasticsearch with visualization tools and other applications using JDBC.
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Apache Zeppelin is used to write business logic and rules and perform other data analytics.
Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
Apache Superset is used as visualization tool by connecting with Spark thrift server.
Metadata is stored in elasticsearch, hence the rules or convention regarding elasticsearch mapping or field names should be followed.
The general idea is to ETL haystack format metadata from edge devices to elasticsearch in cloud.
Data is stored in elasticsearch in project-haystack json format with some conventions.
A boolean field will have suffix _bool_
and numeric field will have suffix _num_
.
Example 1 (Equipment Metadata):
{
"_index": "metadata_equip_v1",
"_type": "metadata",
"_id": "site_boiler_1",
"_score": 1,
"_source": {
"id": "r:site_boiler_1",
"equip": "m:",
"boiler": "m:",
"hvac": "m:",
"capacity": "n:100",
"capacity_num_": 100,
"name": "Boiler 1",
"levelRef": "r:level1",
"siteRef": "r:site"
}
}
Example 2 (Points Metadata):
{
"_index": "metadata_v2",
"_type": "metadata",
"_id": "OAH",
"_score": 1,
"_source": {
"tz": "Sydney",
"air": "m:",
"point": "m:",
"dis": "OAH",
"analytics": "m:",
"regionRef": "r:Western_Corridor",
"his": "m:",
"disMacro": "$equipRef $navName",
"imported": "m:",
"humidity": "m:",
"navName": "OAH",
"equipRef": "r:Site_Building_Info",
"id": "r:Site_Building_Info_OAH",
"levelRef": "r:Site_Plant",
"kind": "Number",
"siteRef": "r:Site",
"unit": "%",
"outside": "m:",
"sensor": "m:"
}
}
Following are the fields for history data.
timestamp, datetime, pointName and value are the timeseries data.
yearMonth derived from time column is for partition.
For a partition, the data will be sorted by time which is main advantage of using FiloDB for timeseries analytics.
|-- pointName: string (nullable = false)
|-- timestamp: timestamp (nullable = false)
|-- datetime: long (nullable = false)
|-- unit: string (nullable = true)
|-- yearMonth: string (nullable = false)
|-- value: double (nullable = true)
Sample Data:
+--------------------+--------------------+-------------------+----+---------+-----+
| pointName| timestamp| datetime|unit|yearMonth|value|
+--------------------+--------------------+-------------------+----+---------+-----+
|S.site.hayTest.Bool1|2018-02-02 00:19:...|1517530772000000000| | 2018-02| 0.0|
|S.site.hayTest.Bool1|2018-02-02 00:20:...|1517530803000000000| | 2018-02| 1.0|
|S.site.hayTest.Bool1|2018-02-02 00:20:...|1517530805000000000| | 2018-02| 0.0|
|S.site.hayTest.Bool1|2018-02-02 00:20:...|1517530811000000000| | 2018-02| 1.0|
|S.site.hayTest.Bool1|2018-02-02 00:20:...|1517530815000000000| | 2018-02| 0.0|
|S.site.hayTest.Bool1|2018-02-02 00:20:...|1517530824000000000| | 2018-02| 0.0|
+--------------------+--------------------+-------------------+----+---------+-----+
meta.es.nodes
-> The host address of elasticsearch installation for metadata. Default:localhost
meta.es.port
-> The port of elasticsearch installation for metadata. Default:9200
meta.es.resource
-> The resource (index/type) of elasticsearch to read metadata from. Default:metadata/metadata
Elasticsearch config can also be set by following environment variables:
META_ES_NODES
META_ES_PORT
META_ES_RESOURCE
from nubespark.ts.iot_his_reader import IOTHistoryReader
es_options = {
"meta.es.nodes":"localhost",
"meta.es.port":"9200",
"meta.es.resource":"niagara4_metadata_v2/metadata",
"meta.es.refresh":"true"
}
his_reader = IOTHistoryReader(sqlContext, dataset="niagara4_history_v1", view_name="niagara4_server1_his", rule_on="equipRef == @S.site.hayTest", **es_options)
dataset
-> name of FiloDB history datasetview_name
-> temporary view created by Apache Spark SQLrule_on
-> equipment level filter in haystack format**es_options
-> kwargs containing configuration options
his_reader.get_df().show()
his_reader.get_metadata_df().show()
from nubespark.ts.datetime.utils import *
boolHis = reader.metadata('his and point and id == @S.site.hayTest.Bool1').history("today").read()
history takes a tuple with two python datetime.datetime elements or some strings like "today", "yesterday", "this_month", etc i.e this_month() = (datetime.datetime(2018, 2, 1, 0, 0), datetime.datetime(2018, 2, 28, 23, 59, 59, 999999)) for Feb 2018
The query may return multiple points as well
histories = his_reader.metadata("his and point and cur", key_col="id").history(this_month()).read()
The read operation can be done by sending either metadata query or history range or both. This means at least one of metadata query or history range must be present.
num1His = his_reader.metadata('his and point and id == @S.site.hayTest.Num1', key_col="id").read()
1) Aggregating on whole history
import pyspark.sql.functions as func
his_reader.get_df().select(func.min("timestamp").alias("min_date"), func.max("timestamp").alias("max_date")).show()
+--------------------+--------------------+
| min_date| max_date|
+--------------------+--------------------+
|2018-02-02 00:19:...|2018-03-25 21:50:...|
+--------------------+--------------------+
2) Joining timeseries dataframes (Temporal Joins)
num1His = reader.metadata('his and point and id == @S.site.hayTest.Num1').history(this_month()).read()
boolHis = reader.metadata('his and point and id == @S.site.hayTest.Bool1').history(this_month()).read()
joinedHis = num1His.leftJoin(boolHis, tolerance="1 days", key =["siteRef", "equipRef"], right_alias="bool1")
3) Merging timeseries dataframes (with same schema) preserving sorting by time
num1His = reader.metadata('his and point and id == @S.site.hayTest.Num1', key_col="id").history(this_month()).read()
num2His = reader.metadata('his and point and id == @S.site.hayTest.Num2', key_col="id").history(this_month()).read()
mergedHis = num1His.merge(num2His)
4) Time window based summarizers
# Showing variance of value in past 1 day using summarizeWindows
from ts.flint import summarizers, windows, clocks
variance_df = histories.summarizeWindows(windows.past_absolute_time("1days"),
summarizer=summarizers.variance("value"),
key=["siteRef","equipName","pointName"])
variance_df.where("pointName != 'S.site.hayTest.Bool1'").show()
+-------------------+--------------------+-------+-------------------+-------+--------------+--------------------+
| time| timestamp| value| pointName|siteRef| equipRef | value_variance|
+-------------------+--------------------+-------+-------------------+-------+--------------+--------------------+
|1520563500017999872|2018-03-09 02:45:...|50.3811|S.site.hayTest.Num2| S.site|S.site.hayTest| NaN|
|1520564400012999936|2018-03-09 03:00:...|50.6379|S.site.hayTest.Num2| S.site|S.site.hayTest| 0.03297311999999958|
|1520565300008000000|2018-03-09 03:15:...|50.7815|S.site.hayTest.Num2| S.site|S.site.hayTest| 0.04114789333333285|
|1520566200001999872|2018-03-09 03:30:...|50.6191|S.site.hayTest.Num2| S.site|S.site.hayTest|0.027521546666666372|
|1520567100028000000|2018-03-09 03:45:...|50.0814|S.site.hayTest.Num2| S.site|S.site.hayTest| 0.07545160999999947|
|1520568000007000064|2018-03-09 04:00:...|50.9536|S.site.hayTest.Num2| S.site|S.site.hayTest| 0.09462321466666548|
+-------------------+--------------------+-------+-------------------+-------+--------------+--------------------+
5) Time window based custom UDF
# Calculating moving average for past 2 days
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as func
def movingAverage(window):
nrows = len(window)
if nrows == 0:
return 0
return sum(row.value for row in window) / nrows
movingAverageUdf = func.udf(lambda window: movingAverage(window), DoubleType())
df = histories.addWindows(windows.past_absolute_time("2days"), key=["siteRef","equipName","pointName"])
movAvg_df = df.withColumn('movingAverage', movingAverageUdf(func.col("window_past_2days"))).drop("window_past_2days")
movAvg_df.where("movingAverage > 0").show()
+-------------------+--------------------+-------+-------------------+-------+--------------+------------------+
| time| timestamp| value| pointName|siteRef| equipRef | movingAverage|
+-------------------+--------------------+-------+-------------------+-------+--------------+------------------+
|1520563500017999872|2018-03-09 02:45:...|50.3811|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.3811|
|1520564400012999936|2018-03-09 03:00:...|50.6379|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.5095|
|1520565300008000000|2018-03-09 03:15:...|50.7815|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.60016666666667|
|1520566200001999872|2018-03-09 03:30:...|50.6191|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.6049|
|1520567100028000000|2018-03-09 03:45:...|50.0814|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.5002|
|1520568000007000064|2018-03-09 04:00:...|50.9536|S.site.hayTest.Num2| S.site|S.site.hayTest|50.575766666666674|
|1520568900016999936|2018-03-09 04:15:...|50.4218|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.55377142857144|
|1520569800011000064|2018-03-09 04:30:...| 50.213|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.51117500000001|
|1520570700011000064|2018-03-09 04:45:...|50.2189|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.47870000000001|
+-------------------+--------------------+-------+-------------------+-------+--------------+------------------+
6) Summarizing dataframes
summarized_stddev_df = movAvg_df.summarize(summarizers.stddev('value'), key=["siteRef", "equipName", "pointName"])
summarized_stddev_df.show()
+----+-------+--------------+--------------------+-------------------+
|time|siteRef| equipRef | pointName| value_stddev|
+----+-------+--------------+--------------------+-------------------+
| 0| S.site|S.site.hayTest|S.site.hayTest.Bool1| 0.0|
| 0| S.site|S.site.hayTest| S.site.hayTest.Num2|0.28202272739583134|
| 0| S.site|S.site.hayTest| S.site.hayTest.Num1|0.28505566501478885|
+----+-------+--------------+--------------------+-------------------+
7) Rolling up histories (Summarizing by intervals)
from nubespark.ts.spark.utils import *
min_max = histories.select(func.min("timestamp").alias("min"), func.max("timestamp").alias("max")).collect()
clock = clocks.uniform(sqlContext, frequency="1day", offset="0ns", begin_date_time=str(min_max[0][0].date()),
end_date_time=str(min_max[0][1].date()), time_zone='GMT')
dailyAvg_df = histories.summarizeIntervals(clock, summarizers.mean("value"), key=["siteRef", "equipName", "pointName"])
dailyAvg_df = get_timestamp_col(dailyAvg_df, "timestamp", tz="GMT")
dailyAvg_df.show()
+-------------------+-------+--------------+--------------------+------------------+--------------------+
| time|siteRef| equipRef | pointName| value_mean| timestamp|
+-------------------+-------+--------------+--------------------+------------------+--------------------+
|1520553600000000000| S.site|S.site.hayTest|S.site.hayTest.Bool1| 0.0|2018-03-09 00:00:...|
|1520553600000000000| S.site|S.site.hayTest| S.site.hayTest.Num2|50.553811764705884|2018-03-09 00:00:...|
|1520640000000000000| S.site|S.site.hayTest| S.site.hayTest.Num2| 50.4913375|2018-03-10 00:00:...|
|1520640000000000000| S.site|S.site.hayTest|S.site.hayTest.Bool1| 0.0|2018-03-10 00:00:...|
|1520726400000000000| S.site|S.site.hayTest| S.site.hayTest.Num2| 50.509790625|2018-03-11 00:00:...|
|1520726400000000000| S.site|S.site.hayTest|S.site.hayTest.Bool1| 0.0|2018-03-11 00:00:...|
+-------------------+-------+--------------+--------------------+------------------+--------------------+