Skip to content

A platform for IOT timeseries analytics using Apache Spark, FiloDB and Flint ts library

Notifications You must be signed in to change notification settings

Aidan275/spark-iot-ts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-iot-ts

A platform for IOT timeseries analytics using Apache Spark, FiloDB and Flint ts library

Objective of project

A Big Data backed platform for analytics of timeseries IOT data in cloud.

Features of this project

Some of the features of this project are:

  • Ease of reading IOT devices histories using metadata and time range using project-haystack format
  • Helpers and utilities for time series data analytics and processing
  • Temporal (preserving sorting by time) merge, joins and other operations using flint ts library
  • Support for streaming, batch analytics and ad-hoc queries using Apache Spark and FiloDB (built on top of Apache Cassandra)

Architecture

spark-iot-ts architecture

Technology Stack

  1. Apache Spark
  2. Apache Cassandra
  3. Elasticsearch
  4. Apache Kafka
  5. Spark Thrift Server
  6. Zeppelin Notebook
  7. Superset

Apache Spark

Apache Spark along with FiloDB library and Flint ts library is the core of this project.

Apache Spark™ is a fast and general engine for large-scale data processing.

Apache Spark can be used for batch processing, streaming, machine learning and graph analytics on data generated by iot devices.

The ability to analyze time series data at scale is critical for the success of finance and IoT applications based on Spark. Flint is Two Sigma's implementation of highly optimized time series operations in Spark. It performs truly parallel and rich analyses on time series data by taking advantage of the natural ordering in time series data to provide locality-based optimizations.

Flint ts library supports temporal (preserving sorting by time) joins, merges and other timeseries related operations like group by cycles (same timestamp), group by intervals, time based windows, etc.

FiloDB is a new open-source distributed, versioned, and columnar analytical database designed for modern streaming workloads.

FiloDB uses Apache Cassandra as data storage and built around Apache Spark framework. There is flexibility of data partitioning, sorting while storing, fast read/write and versioned data storage which is optimized for timeseries and events data.

Currently tested for: Apache Spark 2.0.2 Libraries forked at:

Apache Cassandra

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Apache Cassandra is used as data storage (IOT histories) by FiloDB.

Elasticsearch

Elasticsearch is a distributed RESTful search engine built for the cloud.

Elasticsearch is used for storing and querying IOT metadata in project-haystack format.

Apache Kafka

Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.

Apache Kafka is used as buffer in streaming uses cases.

Spark Thrift Server

Spark Thrift Server comes in Apache spark distribution and can be used to expose spark dataframes as SQL tables. It is used to connect histories in FiloDB and metadata in elasticsearch with visualization tools and other applications using JDBC.

Zeppelin Notebook

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin is used to write business logic and rules and perform other data analytics.

Superset

Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

Apache Superset is used as visualization tool by connecting with Spark thrift server.

Schema

Metadata Schema

Metadata is stored in elasticsearch, hence the rules or convention regarding elasticsearch mapping or field names should be followed. The general idea is to ETL haystack format metadata from edge devices to elasticsearch in cloud. Data is stored in elasticsearch in project-haystack json format with some conventions. A boolean field will have suffix _bool_ and numeric field will have suffix _num_. Example 1 (Equipment Metadata):

{
"_index": "metadata_equip_v1",
"_type": "metadata",
"_id": "site_boiler_1",
"_score": 1,
"_source": {
"id": "r:site_boiler_1",
"equip": "m:",
"boiler": "m:",
"hvac": "m:",
"capacity": "n:100",
"capacity_num_": 100,
"name": "Boiler 1",
"levelRef": "r:level1",
"siteRef": "r:site"
}
}

Example 2 (Points Metadata):

{
"_index": "metadata_v2",
"_type": "metadata",
"_id": "OAH",
"_score": 1,
"_source": {
"tz": "Sydney",
"air": "m:",
"point": "m:",
"dis": "OAH",
"analytics": "m:",
"regionRef": "r:Western_Corridor",
"his": "m:",
"disMacro": "$equipRef $navName",
"imported": "m:",
"humidity": "m:",
"navName": "OAH",
"equipRef": "r:Site_Building_Info",
"id": "r:Site_Building_Info_OAH",
"levelRef": "r:Site_Plant",
"kind": "Number",
"siteRef": "r:Site",
"unit": "%",
"outside": "m:",
"sensor": "m:"
}
}

Histories Schema

Following are the fields for history data.

timestamp, datetime, pointName and value are the timeseries data.

yearMonth derived from time column is for partition.

For a partition, the data will be sorted by time which is main advantage of using FiloDB for timeseries analytics.

 |-- pointName: string (nullable = false)
 |-- timestamp: timestamp (nullable = false)
 |-- datetime: long (nullable = false)
 |-- unit: string (nullable = true)
 |-- yearMonth: string (nullable = false)
 |-- value: double (nullable = true)

Sample Data:

+--------------------+--------------------+-------------------+----+---------+-----+
|           pointName|           timestamp|           datetime|unit|yearMonth|value|
+--------------------+--------------------+-------------------+----+---------+-----+
|S.site.hayTest.Bool1|2018-02-02 00:19:...|1517530772000000000|    |  2018-02|  0.0|
|S.site.hayTest.Bool1|2018-02-02 00:20:...|1517530803000000000|    |  2018-02|  1.0|
|S.site.hayTest.Bool1|2018-02-02 00:20:...|1517530805000000000|    |  2018-02|  0.0|
|S.site.hayTest.Bool1|2018-02-02 00:20:...|1517530811000000000|    |  2018-02|  1.0|
|S.site.hayTest.Bool1|2018-02-02 00:20:...|1517530815000000000|    |  2018-02|  0.0|
|S.site.hayTest.Bool1|2018-02-02 00:20:...|1517530824000000000|    |  2018-02|  0.0|
+--------------------+--------------------+-------------------+----+---------+-----+

Examples

Initializing reader

Configuration Options (Optional)

  • meta.es.nodes -> The host address of elasticsearch installation for metadata. Default: localhost
  • meta.es.port -> The port of elasticsearch installation for metadata. Default: 9200
  • meta.es.resource -> The resource (index/type) of elasticsearch to read metadata from. Default: metadata/metadata

Elasticsearch config can also be set by following environment variables:

  • META_ES_NODES
  • META_ES_PORT
  • META_ES_RESOURCE
from nubespark.ts.iot_his_reader import IOTHistoryReader
es_options = {
 "meta.es.nodes":"localhost",
 "meta.es.port":"9200",
 "meta.es.resource":"niagara4_metadata_v2/metadata",
 "meta.es.refresh":"true"
}
his_reader = IOTHistoryReader(sqlContext, dataset="niagara4_history_v1", view_name="niagara4_server1_his", rule_on="equipRef == @S.site.hayTest", **es_options)
  • dataset -> name of FiloDB history dataset
  • view_name -> temporary view created by Apache Spark SQL
  • rule_on -> equipment level filter in haystack format
  • **es_options -> kwargs containing configuration options

Accessing history as Spark dataframe

his_reader.get_df().show()

Accessing metadata as Spark dataframe

his_reader.get_metadata_df().show()

Reading point using metadata and history range

from nubespark.ts.datetime.utils import *
boolHis = reader.metadata('his and point and id == @S.site.hayTest.Bool1').history("today").read()

history takes a tuple with two python datetime.datetime elements or some strings like "today", "yesterday", "this_month", etc i.e this_month() = (datetime.datetime(2018, 2, 1, 0, 0), datetime.datetime(2018, 2, 28, 23, 59, 59, 999999)) for Feb 2018

The query may return multiple points as well

histories = his_reader.metadata("his and point and cur", key_col="id").history(this_month()).read()

The read operation can be done by sending either metadata query or history range or both. This means at least one of metadata query or history range must be present.

num1His = his_reader.metadata('his and point and id == @S.site.hayTest.Num1', key_col="id").read()

Using Apache Spark and Flint Ts Library for timeseries operations

1) Aggregating on whole history

import pyspark.sql.functions as func
his_reader.get_df().select(func.min("timestamp").alias("min_date"), func.max("timestamp").alias("max_date")).show()
+--------------------+--------------------+
|            min_date|            max_date|
+--------------------+--------------------+
|2018-02-02 00:19:...|2018-03-25 21:50:...|
+--------------------+--------------------+

2) Joining timeseries dataframes (Temporal Joins)

num1His = reader.metadata('his and point and id == @S.site.hayTest.Num1').history(this_month()).read()
boolHis = reader.metadata('his and point and id == @S.site.hayTest.Bool1').history(this_month()).read()

joinedHis = num1His.leftJoin(boolHis, tolerance="1 days", key =["siteRef", "equipRef"], right_alias="bool1")

3) Merging timeseries dataframes (with same schema) preserving sorting by time

num1His = reader.metadata('his and point and id == @S.site.hayTest.Num1', key_col="id").history(this_month()).read()
num2His = reader.metadata('his and point and id == @S.site.hayTest.Num2', key_col="id").history(this_month()).read()

mergedHis = num1His.merge(num2His)

4) Time window based summarizers

# Showing variance of value in past 1 day using summarizeWindows

from   ts.flint              import summarizers, windows, clocks

variance_df = histories.summarizeWindows(windows.past_absolute_time("1days"),
summarizer=summarizers.variance("value"),
key=["siteRef","equipName","pointName"])

variance_df.where("pointName != 'S.site.hayTest.Bool1'").show()

+-------------------+--------------------+-------+-------------------+-------+--------------+--------------------+
|               time|           timestamp|  value|          pointName|siteRef|     equipRef |      value_variance|
+-------------------+--------------------+-------+-------------------+-------+--------------+--------------------+
|1520563500017999872|2018-03-09 02:45:...|50.3811|S.site.hayTest.Num2| S.site|S.site.hayTest|                 NaN|
|1520564400012999936|2018-03-09 03:00:...|50.6379|S.site.hayTest.Num2| S.site|S.site.hayTest| 0.03297311999999958|
|1520565300008000000|2018-03-09 03:15:...|50.7815|S.site.hayTest.Num2| S.site|S.site.hayTest| 0.04114789333333285|
|1520566200001999872|2018-03-09 03:30:...|50.6191|S.site.hayTest.Num2| S.site|S.site.hayTest|0.027521546666666372|
|1520567100028000000|2018-03-09 03:45:...|50.0814|S.site.hayTest.Num2| S.site|S.site.hayTest| 0.07545160999999947|
|1520568000007000064|2018-03-09 04:00:...|50.9536|S.site.hayTest.Num2| S.site|S.site.hayTest| 0.09462321466666548|
+-------------------+--------------------+-------+-------------------+-------+--------------+--------------------+

5) Time window based custom UDF

# Calculating moving average for past 2 days

from pyspark.sql.types import DoubleType
import pyspark.sql.functions as func

def movingAverage(window):
    nrows = len(window)
    if nrows == 0:
        return 0
    return sum(row.value for row in window) / nrows

movingAverageUdf = func.udf(lambda window: movingAverage(window), DoubleType())

df = histories.addWindows(windows.past_absolute_time("2days"), key=["siteRef","equipName","pointName"])
movAvg_df = df.withColumn('movingAverage', movingAverageUdf(func.col("window_past_2days"))).drop("window_past_2days")
movAvg_df.where("movingAverage > 0").show()

+-------------------+--------------------+-------+-------------------+-------+--------------+------------------+
|               time|           timestamp|  value|          pointName|siteRef|     equipRef |     movingAverage|
+-------------------+--------------------+-------+-------------------+-------+--------------+------------------+
|1520563500017999872|2018-03-09 02:45:...|50.3811|S.site.hayTest.Num2| S.site|S.site.hayTest|           50.3811|
|1520564400012999936|2018-03-09 03:00:...|50.6379|S.site.hayTest.Num2| S.site|S.site.hayTest|           50.5095|
|1520565300008000000|2018-03-09 03:15:...|50.7815|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.60016666666667|
|1520566200001999872|2018-03-09 03:30:...|50.6191|S.site.hayTest.Num2| S.site|S.site.hayTest|           50.6049|
|1520567100028000000|2018-03-09 03:45:...|50.0814|S.site.hayTest.Num2| S.site|S.site.hayTest|           50.5002|
|1520568000007000064|2018-03-09 04:00:...|50.9536|S.site.hayTest.Num2| S.site|S.site.hayTest|50.575766666666674|
|1520568900016999936|2018-03-09 04:15:...|50.4218|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.55377142857144|
|1520569800011000064|2018-03-09 04:30:...| 50.213|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.51117500000001|
|1520570700011000064|2018-03-09 04:45:...|50.2189|S.site.hayTest.Num2| S.site|S.site.hayTest| 50.47870000000001|
+-------------------+--------------------+-------+-------------------+-------+--------------+------------------+

6) Summarizing dataframes

summarized_stddev_df = movAvg_df.summarize(summarizers.stddev('value'), key=["siteRef", "equipName", "pointName"])
summarized_stddev_df.show()

+----+-------+--------------+--------------------+-------------------+
|time|siteRef|     equipRef |           pointName|       value_stddev|
+----+-------+--------------+--------------------+-------------------+
|   0| S.site|S.site.hayTest|S.site.hayTest.Bool1|                0.0|
|   0| S.site|S.site.hayTest| S.site.hayTest.Num2|0.28202272739583134|
|   0| S.site|S.site.hayTest| S.site.hayTest.Num1|0.28505566501478885|
+----+-------+--------------+--------------------+-------------------+

7) Rolling up histories (Summarizing by intervals)

from nubespark.ts.spark.utils import *
min_max = histories.select(func.min("timestamp").alias("min"), func.max("timestamp").alias("max")).collect()

clock = clocks.uniform(sqlContext, frequency="1day", offset="0ns", begin_date_time=str(min_max[0][0].date()),
 end_date_time=str(min_max[0][1].date()), time_zone='GMT')

dailyAvg_df = histories.summarizeIntervals(clock, summarizers.mean("value"), key=["siteRef", "equipName", "pointName"])
dailyAvg_df = get_timestamp_col(dailyAvg_df, "timestamp", tz="GMT")
dailyAvg_df.show()
+-------------------+-------+--------------+--------------------+------------------+--------------------+
|               time|siteRef|     equipRef |           pointName|        value_mean|           timestamp|
+-------------------+-------+--------------+--------------------+------------------+--------------------+
|1520553600000000000| S.site|S.site.hayTest|S.site.hayTest.Bool1|               0.0|2018-03-09 00:00:...|
|1520553600000000000| S.site|S.site.hayTest| S.site.hayTest.Num2|50.553811764705884|2018-03-09 00:00:...|
|1520640000000000000| S.site|S.site.hayTest| S.site.hayTest.Num2|        50.4913375|2018-03-10 00:00:...|
|1520640000000000000| S.site|S.site.hayTest|S.site.hayTest.Bool1|               0.0|2018-03-10 00:00:...|
|1520726400000000000| S.site|S.site.hayTest| S.site.hayTest.Num2|      50.509790625|2018-03-11 00:00:...|
|1520726400000000000| S.site|S.site.hayTest|S.site.hayTest.Bool1|               0.0|2018-03-11 00:00:...|
+-------------------+-------+--------------+--------------------+------------------+--------------------+

About

A platform for IOT timeseries analytics using Apache Spark, FiloDB and Flint ts library

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published