Skip to content

0.1.0 (Storm 1.0) COUNT DISTINCT and micro-batching

Compare
Choose a tag to compare
@akshaisarma akshaisarma released this 10 Jan 01:06
· 150 commits to master since this release

This releases adds the first of the DataSketches based aggregations - the COUNT DISTINCT. It enables you count distinct values for a set of fields. It is exact upto a certain number of values and approximate after. However, the errors bounds are completely measurable. The result metadata field will expose the standard deviations and other metadata to you, if you choose to turn it on. See the new settings added below.

Also added was making the RAW (LIMIT) aggregation capable of micro-batching. Previously, it was hard-coded to micro-batch, where the micro-batch was of size 1. This made the RAW aggregation perform very fast when the overall number of records the query was looking for was reached. With micro-batches, you can trade off your performance to reduce the number of times a Filter Bolt emits a batch of records to the Join Bolt for a query, if this is something that you needed to tweak.

The new settings added/modified in this release (take a look at bullet_defaults.yaml for what these mean):

rule.aggregation.composite.field.separator: "|"
rule.aggregation.raw.micro.batch.size: 1
rule.aggregation.count.distinct.sketch.entries: 16384
rule.aggregation.count.distinct.sketch.sampling: 1.0
rule.aggregation.count.distinct.sketch.family: "Alpha"
rule.aggregation.count.distinct.sketch.resize.factor: 8
result.metadata.metrics:
    - name: "Rule Identifier"
      key: "rule_id"
    - name: "Rule Body"
      key: "rule_body"
    - name: "Creation Time"
      key: "rule_receive_time"
    - name: "Termination Time"
      key: "rule_finish_time"
    - name: "Aggregation Metadata"
      key: "aggregation"
    - name: "Estimated Result"
      key: "wasEstimated"
    - name: "Standard Deviations"
      key: "standardDeviations"
    - name: "Sketch Family"
      key: "sketchFamily"
    - name: "Sketch Size"
      key: "sketchSize"
    - name: "Sketch Theta"
      key: "sketchTheta"