Skip to content

0.2.0 (Storm 0.10) GROUP BY and DISTINCT

Compare
Choose a tag to compare
@akshaisarma akshaisarma released this 26 Jan 22:56
· 169 commits to master since this release

Bullet now supports GROUP by and DISTINCT. All the operations like SUM, COUNT, MIN, MAX, AVG can be now performed per group (a set of fields). DISTINCT is simply a GROUP by without any operations.

Operations done using a GROUP aggregation also try to cast non-numeric values to numbers. This is particularly useful if your data schema does not use proper types. For instance, you may put a string, a number and a boolean into the same map (converting everything to string) because these key values are related conceptually. However, because you changed your number to a string, you would not be able to perform any operations like SUM, AVG on it. Bullet will try to cast your string to a number if possible if you did do such operations on these mistyped fields.

Several new settings related to these new aggregations were added and existing ones changed. Previously, the aggregation max size setting was being used to control the RAW aggregation size limits. There is now a separate setting for the RAW aggregation size limits. The size for GROUP BY is driven by the aggregation size as is its max value.

rule.aggregation.max.size: 512
rule.aggregation.raw.max.size: 30
rule.aggregation.group.sketch.entries: 512
rule.aggregation.group.sketch.sampling: 1.0
rule.aggregation.group.sketch.resize.factor: 8
result.metadata.metrics:
    - name: "Uniques Estimate"
      key: "uniquesEstimate"

See the README for examples and how to write these queries. See bullet_defaults.yaml for the default values and explanations for the new settings added.

There were some changes to methods and their locations and package names (com.yahoo.bullet.drpc -> com.yahoo.bullet.storm). These should be largely irrelevant unless you were depending on the code directly.