-
Notifications
You must be signed in to change notification settings - Fork 92
Approximate Count and Spatial Queries
We can push approximate count and spatial filters to Druid, if the index has the relevant HyperLogLog, Sketch or Spatial Index on the column.
Approximate Aggregations are pushed to Druid if the pushHLLTODruid option is true on the Druid DataSource(by default this is true). The approx_count_distinct is translated to a hyperUnique aggregation in Druid; whereas DataSketchhive udfs are pushed down as a ThetaSketch aggregator to druid.
When defining a Druid DataSource define DruidRelationColumnInfos to describe the association of source columns to the relevant HyperLogLog, Sketch or Spatial Index. For example consider a ZipCodes DataSet where each row captures the recording of presence(no other metric) at a given time, location(longitude, latitude) and place(state, county, city, zipcode). We build a Druid Index with a HLL and Sketch for the City column(i.e. the city dimension is replaced by an approximate sketch data-structure of it) and a Spatial Index based on the latitude/longitude fields. The Field definition will be:
[
{
"jsonClass":"DruidRelationColumnInfo",
"column":"city",
"hllMetric":"unique_city",
"sketchMetric":"city_sketch"
},
{
"jsonClass":"DruidRelationColumnInfo",
"column":"latitude",
"spatialIndex":{
"druidColumn":"coordinates",
"spatialPosition":0,
"minValue":-90.0,
"maxValue":90.0
}
},
{
"jsonClass":"DruidRelationColumnInfo",
"column":"longitude",
"spatialIndex":{
"druidColumn":"coordinates",
"spatialPosition":1,
"minValue":-180.0,
"maxValue":180.0
}
}
]
The entire DDL for the Druid DataSource is:
CREATE TABLE if not exists zipCodes
USING org.sparklinedata.druid
OPTIONS (sourceDataframe "default.zipCodesBase",
timeDimensionColumn "record_date",
druidDatasource "zipCodes",
druidHost 'localhost',
zkQualifyDiscoveryNames "true",
numProcessingThreadsPerHistorical '1',
nonAggregateQueryHandling "push_project_and_filters",
columnInfos '[ { "jsonClass" : "DruidRelationColumnInfo", "column" : "city", "hllMetric" : "unique_city", "sketchMetric" : "city_sketch" }, { "jsonClass" : "DruidRelationColumnInfo", "column" : "latitude", "spatialIndex" : { "druidColumn" : "coordinates", "spatialPosition" : 0, "minValue" : -90.0, "maxValue" : 90.0 } }, { "jsonClass" : "DruidRelationColumnInfo", "column" : "longitude", "spatialIndex" : { "druidColumn" : "coordinates", "spatialPosition" : 1, "minValue" : -180.0, "maxValue" : 180.0 } } ]',
allowTopNRewrite "true")
Now we can write queries that compute the approximate count for cities. Here are couple of examples:
select approx_count_distinct(city)
from zipCodes;
select approx_count_distinct(city)
from zipCodesFull
where substring(state,1,1) = 'N';
These get executed as hyperunique aggregations on Druid. For the second query the Druid Query is:
{
"jsonClass":"TimeSeriesQuerySpec",
"queryType":"timeseries",
"dataSource":"zipCodesAll",
"intervals":[
"2016-01-01T00:00:00.000Z/2016-01-01T00:00:01.000Z"
],
"granularity":"all",
"filter":{
"jsonClass":"JavascriptFilterSpec",
"type":"javascript",
"dimension":"state",
"function":"function (state) {\n \n \n \n\n return(((state).substr(0, 1)) == (\"N\"));\n }"
},
"aggregations":[
{
"type":"hyperUnique",
"name":"alias-1",
"fieldName":"unique_city"
}
],
"context":{
"queryId":"query-1091728133402033"
}
}
Coming Soon
You can use Druid Spatial Indexing to build a Spatial Index of 1 or more dimension columns. You then have to describe in the DDL the source column association to the Spatial Index using DruidRelationColumnInfos. For the ZipCodes dataset containing latitude/longitude information for each fact, we describe this so:
[
...,
{
"jsonClass":"DruidRelationColumnInfo",
"column":"latitude",
"spatialIndex":{
"druidColumn":"coordinates",
"spatialPosition":0,
"minValue":-90.0,
"maxValue":90.0
}
},
{
"jsonClass":"DruidRelationColumnInfo",
"column":"longitude",
"spatialIndex":{
"druidColumn":"coordinates",
"spatialPosition":1,
"minValue":-180.0,
"maxValue":180.0
}
}
]
The complete DDL for the ZipCode dataset was given in the Approximate Count Queries section.
You can now write Select and GroupBy queries with filters on the latitude/longitude that get pushed as Druid Spatial Filters with Rectangular Bounds. For example the following query:
select latitude, longitude
from zipCodes
where latitude > 42.5 and longitude is not null
generates the following Druid Query:
{
"jsonClass":"SelectSpecWithIntervals",
"queryType":"select",
"dataSource":"zipCodes",
"dimensions":[
"coordinates",
"coordinates"
],
"metrics":[
"count"
],
"filter":{
"jsonClass":"SpatialFilterSpec",
"dimension":"coordinates",
"bound":{
"jsonClass":"RectangularBound",
"minCoords":[
42.5,
-180.0
],
"maxCoords":[
90.0,
180.0
],
"type":"rectangular"
},
"type":"spatial"
},
"pagingSpec":{
"pagingIdentifiers":{
},
"threshold":10000
},
"intervals":[
"2016-01-01T00:00:00.000Z/2016-01-01T00:00:01.000Z"
],
"descending":false,
"granularity":"all",
"context":{
"queryId":"query-1154108458312940"
}
}
And the following GroupBy query:
select approx_count_distinct(city)
from zipCodes
where latitude > 0 and longitude is not null
and latitude < 18 or (longitude > -80 and longitude < 10)
generates the following Druid Query:
{
"jsonClass":"TimeSeriesQuerySpec",
"queryType":"timeseries",
"dataSource":"zipCodes",
"intervals":[
"2016-01-01T00:00:00.000Z/2016-01-01T00:00:01.000Z"
],
"granularity":"all",
"filter":{
"jsonClass":"LogicalFilterSpec",
"type":"or",
"fields":[
{
"jsonClass":"LogicalFilterSpec",
"type":"and",
"fields":[
{
"jsonClass":"SpatialFilterSpec",
"dimension":"coordinates",
"bound":{
"jsonClass":"RectangularBound",
"minCoords":[
4.9E-324,
-180.0
],
"maxCoords":[
18.0,
180.0
],
"type":"rectangular"
},
"type":"spatial"
}
]
},
{
"jsonClass":"LogicalFilterSpec",
"type":"and",
"fields":[
{
"jsonClass":"SpatialFilterSpec",
"dimension":"coordinates",
"bound":{
"jsonClass":"RectangularBound",
"minCoords":[
-90.0,
-80.0
],
"maxCoords":[
90.0,
10.0
],
"type":"rectangular"
},
"type":"spatial"
}
]
}
]
},
"aggregations":[
{
"type":"hyperUnique",
"name":"alias-1",
"fieldName":"unique_city"
}
],
"context":{
"queryId":"query-1154332234832112"
}
}
Note how where possible multiple SQL predicates on spatial dimensions are combined into a single Druid Spatial Filter
- Overview
- Quick Start
-
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
- Dev. Guide
- Reference Architectures
- Releases
- Cluster Spinup Tool
- TPCH Benchmark