Skip to content

Releases: bullet-db/bullet-storm

0.2.1 (Storm 1.0) Acking support and bug fixes

16 Feb 01:11
Compare
Choose a tag to compare

Enabling acking support by replacing a Storm provided component (backtype.storm.drpc.PrepareRequest) with a custom one PrepareRequestBolt.java. This custom component does not anchor, while the old one did. Since DRPC tuples were being anchored, if they were not acked within the topology level timeout (default 30 s), all queries that lasted longer than that would automatically be failed. Obviously, this would not work for us, so we were recommending to turn off acking entirely. With this change, you can now enable acking. This lets you control how your data source (spout or topology) emits data to the FilterBolts and get reliability if you need it. As before, anchoring is not done in the Bullet components (due to how many tuples may potentially need to kept track of for aggregations).

Bug fix: The setting rule.aggregation.raw.max.size was not being honored. It now is.

Bug fix: You could previously submit a GROUP type aggregation query that specified no fields or operations. This caused the worker to crash. It now reports an error.

#11

0.2.1 (Storm 0.10) Acking support and bug fixes

16 Feb 01:10
Compare
Choose a tag to compare

Enabling acking support by replacing a Storm provided component (backtype.storm.drpc.PrepareRequest) with a custom one PrepareRequestBolt.java. This custom component does not anchor, while the old one did. Since DRPC tuples were being anchored, if they were not acked within the topology level timeout (default 30 s), all queries that lasted longer than that would automatically be failed. Obviously, this would not work for us, so we were recommending to turn off acking entirely. With this change, you can now enable acking. This lets you control how your data source (spout or topology) emits data to the FilterBolts and get reliability if you need it. As before, anchoring is not done in the Bullet components (due to how many tuples may potentially need to kept track of for aggregations).

Bug fix: The setting rule.aggregation.raw.max.size was not being honored. It now is.

Bug fix: You could previously submit a GROUP type aggregation query that specified no fields or operations. This caused the worker to crash. It now reports an error.

#11

0.2.0 (Storm 1.0) GROUP BY and DISTINCT

26 Jan 22:55
Compare
Choose a tag to compare

Bullet now supports GROUP by and DISTINCT. All the operations like SUM, COUNT, MIN, MAX, AVG can be now performed per group (a set of fields). DISTINCT is simply a GROUP by without any operations.

Operations done using a GROUP aggregation also try to cast non-numeric values to numbers. This is particularly useful if your data schema does not use proper types. For instance, you may put a string, a number and a boolean into the same map (converting everything to string) because these key values are related conceptually. However, because you changed your number to a string, you would not be able to perform any operations like SUM, AVG on it. Bullet will try to cast your string to a number if possible if you did do such operations on these mistyped fields.

Several new settings related to these new aggregations were added and existing ones changed. Previously, the aggregation max size setting was being used to control the RAW aggregation size limits. There is now a separate setting for the RAW aggregation size limits. The size for GROUP BY is driven by the aggregation size as is its max value.

rule.aggregation.max.size: 512
rule.aggregation.raw.max.size: 30
rule.aggregation.group.sketch.entries: 512
rule.aggregation.group.sketch.sampling: 1.0
rule.aggregation.group.sketch.resize.factor: 8
result.metadata.metrics:
    - name: "Uniques Estimate"
      key: "uniquesEstimate"

See the README for examples and how to write these queries. See bullet_defaults.yaml for the default values and explanations for the new settings added.

There were some changes to methods and their locations and package names (com.yahoo.bullet.drpc -> com.yahoo.bullet.storm). These should be largely irrelevant unless you were depending on the code directly.

0.2.0 (Storm 0.10) GROUP BY and DISTINCT

26 Jan 22:56
Compare
Choose a tag to compare

Bullet now supports GROUP by and DISTINCT. All the operations like SUM, COUNT, MIN, MAX, AVG can be now performed per group (a set of fields). DISTINCT is simply a GROUP by without any operations.

Operations done using a GROUP aggregation also try to cast non-numeric values to numbers. This is particularly useful if your data schema does not use proper types. For instance, you may put a string, a number and a boolean into the same map (converting everything to string) because these key values are related conceptually. However, because you changed your number to a string, you would not be able to perform any operations like SUM, AVG on it. Bullet will try to cast your string to a number if possible if you did do such operations on these mistyped fields.

Several new settings related to these new aggregations were added and existing ones changed. Previously, the aggregation max size setting was being used to control the RAW aggregation size limits. There is now a separate setting for the RAW aggregation size limits. The size for GROUP BY is driven by the aggregation size as is its max value.

rule.aggregation.max.size: 512
rule.aggregation.raw.max.size: 30
rule.aggregation.group.sketch.entries: 512
rule.aggregation.group.sketch.sampling: 1.0
rule.aggregation.group.sketch.resize.factor: 8
result.metadata.metrics:
    - name: "Uniques Estimate"
      key: "uniquesEstimate"

See the README for examples and how to write these queries. See bullet_defaults.yaml for the default values and explanations for the new settings added.

There were some changes to methods and their locations and package names (com.yahoo.bullet.drpc -> com.yahoo.bullet.storm). These should be largely irrelevant unless you were depending on the code directly.

0.1.0 (Storm 0.10) COUNT DISTINCT and micro-batching

10 Jan 01:07
Compare
Choose a tag to compare

This releases adds the first of the DataSketches based aggregations - the COUNT DISTINCT. It enables you count distinct values for a set of fields. It is exact upto a certain number of values and approximate after. However, the errors bounds are completely measurable. The result metadata field will expose the standard deviations and other metadata to you, if you choose to turn it on. See the new settings added below.

Also added was making the RAW (LIMIT) aggregation capable of micro-batching. Previously, it was hard-coded to micro-batch, where the micro-batch was of size 1. This made the RAW aggregation perform very fast when the overall number of records the query was looking for was reached. With micro-batches, you can trade off your performance to reduce the number of times a Filter Bolt emits a batch of records to the Join Bolt for a query, if this is something that you needed to tweak.

The new settings added/modified in this release (take a look at bullet_defaults.yaml for what these mean):

rule.aggregation.composite.field.separator: "|"
rule.aggregation.raw.micro.batch.size: 1
rule.aggregation.count.distinct.sketch.entries: 16384
rule.aggregation.count.distinct.sketch.sampling: 1.0
rule.aggregation.count.distinct.sketch.family: "Alpha"
rule.aggregation.count.distinct.sketch.resize.factor: 8
result.metadata.metrics:
    - name: "Rule Identifier"
      key: "rule_id"
    - name: "Rule Body"
      key: "rule_body"
    - name: "Creation Time"
      key: "rule_receive_time"
    - name: "Termination Time"
      key: "rule_finish_time"
    - name: "Aggregation Metadata"
      key: "aggregation"
    - name: "Estimated Result"
      key: "wasEstimated"
    - name: "Standard Deviations"
      key: "standardDeviations"
    - name: "Sketch Family"
      key: "sketchFamily"
    - name: "Sketch Size"
      key: "sketchSize"
    - name: "Sketch Theta"
      key: "sketchTheta"

0.1.0 (Storm 1.0) COUNT DISTINCT and micro-batching

10 Jan 01:06
Compare
Choose a tag to compare

This releases adds the first of the DataSketches based aggregations - the COUNT DISTINCT. It enables you count distinct values for a set of fields. It is exact upto a certain number of values and approximate after. However, the errors bounds are completely measurable. The result metadata field will expose the standard deviations and other metadata to you, if you choose to turn it on. See the new settings added below.

Also added was making the RAW (LIMIT) aggregation capable of micro-batching. Previously, it was hard-coded to micro-batch, where the micro-batch was of size 1. This made the RAW aggregation perform very fast when the overall number of records the query was looking for was reached. With micro-batches, you can trade off your performance to reduce the number of times a Filter Bolt emits a batch of records to the Join Bolt for a query, if this is something that you needed to tweak.

The new settings added/modified in this release (take a look at bullet_defaults.yaml for what these mean):

rule.aggregation.composite.field.separator: "|"
rule.aggregation.raw.micro.batch.size: 1
rule.aggregation.count.distinct.sketch.entries: 16384
rule.aggregation.count.distinct.sketch.sampling: 1.0
rule.aggregation.count.distinct.sketch.family: "Alpha"
rule.aggregation.count.distinct.sketch.resize.factor: 8
result.metadata.metrics:
    - name: "Rule Identifier"
      key: "rule_id"
    - name: "Rule Body"
      key: "rule_body"
    - name: "Creation Time"
      key: "rule_receive_time"
    - name: "Termination Time"
      key: "rule_finish_time"
    - name: "Aggregation Metadata"
      key: "aggregation"
    - name: "Estimated Result"
      key: "wasEstimated"
    - name: "Standard Deviations"
      key: "standardDeviations"
    - name: "Sketch Family"
      key: "sketchFamily"
    - name: "Sketch Size"
      key: "sketchSize"
    - name: "Sketch Theta"
      key: "sketchTheta"

0.0.3 (Storm 0.10) The rest of the the Group All operations

22 Dec 21:48
Compare
Choose a tag to compare

You can now perform SUM, COUNT, MIN, MAX and AVG across your entire dataset (as opposed to per group).

0.0.3 (Storm 1.0) The rest of the the Group All operations

22 Dec 21:48
Compare
Choose a tag to compare

You can now perform SUM, COUNT, MIN, MAX and AVG across your entire dataset (as opposed to per group).

0.0.2 (Storm 0.10) Removing localGrouping

21 Dec 19:34
Compare
Choose a tag to compare

This change removes the localOrShuffleGrouping partitioning used when building the Storm topology. This is relevant when your DataSource spouts do not have the same number of executors as the FilterBolts, causing skew.

localOrShuffleGrouping would prefer local executors but Bullet requires equal load on each FilterBolt since all of them run all of the queries simultaneously. Skew is not nice and reduces our overall scalability.

0.0.2 (Storm 1.0) Removing localGrouping

21 Dec 19:34
Compare
Choose a tag to compare

This change removes the localOrShuffleGrouping partitioning used when building the Storm topology. This is relevant when your DataSource spouts do not have the same number of executors as the FilterBolts, causing skew.

localOrShuffleGrouping would prefer local executors but Bullet requires equal load on each FilterBolt since all of them run all of the queries simultaneously. Skew is not nice and reduces our overall scalability.