Skip to content

Commit

Permalink
Add datasketches extension
Browse files Browse the repository at this point in the history
  • Loading branch information
rustyconover authored and carlopi committed Dec 31, 2024
1 parent 3e9ced3 commit 2139bb0
Show file tree
Hide file tree
Showing 2 changed files with 80 additions and 0 deletions.
77 changes: 77 additions & 0 deletions extensions/datasketches/description.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
docs:
extended_description: |
This extension provides an interface to the [Apache DataSketches](https://datasketches.apache.org/) library. This extension enables users to efficiently compute approximate results for large datasets directly within DuckDB, using state-of-the-art streaming algorithms for distinct counting, quantile estimation, and more.
## Why use this extension?
DuckDB already has great implementations of [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) via `approx_count_distinct(x)` and [TDigest](https://arxiv.org/abs/1902.04023) via `approx_quantile(x, pos)`, but it doesn't expose the internal state of the aggregates nor allow the the user to tune all of the parameters of the sketches. This extension allows data sketches to be serialized as `BLOB`s which can be stored and shared across different systems, processes, and environments without loss of fidelity. This makes data sketches highly useful in distributed data processing pipelines.
This extension has implemented these sketches from Apache DataSketches.
- Quantile Estimation
- [TDigest](https://datasketches.apache.org/docs/tdigest/tdigest.html)
- [Classic Quantile](https://datasketches.apache.org/docs/Quantiles/ClassicQuantilesSketch.html)
- [Relative Error Quantile (REQ)](https://datasketches.apache.org/docs/REQ/ReqSketch.html)
- [KLL](https://datasketches.apache.org/docs/KLL/KLLSketch.html)
- Approximate Distinct Count
- [Compressed Probability Counting (CPC)](https://datasketches.apache.org/docs/CPC/CpcSketches.html)
- [HyperLogLog (HLL)](https://datasketches.apache.org/docs/HLL/HllSketches.html)
For more information and information regarding usage, see the [README](https://github.com/rustyconover/duckdb-datasketches).
hello_world: |
-- This is just a demonstration of a single sketch type,
-- see the README for more sketches.
--
-- Lets simulate a temperature sensor
CREATE TABLE readings(temp integer);
INSERT INTO readings(temp) select unnest(generate_series(1, 10));
-- Create a sketch by aggregating id over the readings table.
SELECT datasketch_tdigest_rank(datasketch_tdigest(10, temp), 5) from readings;
┌────────────────────────────────────────────────────────────┐
│ datasketch_tdigest_rank(datasketch_tdigest(10, "temp"), 5) │
│ double │
├────────────────────────────────────────────────────────────┤
│ 0.45 │
└────────────────────────────────────────────────────────────┘
-- Put some more readings in at the high end.
INSERT INTO readings(temp) values (10), (10), (10), (10);
-- Now the rank of 5 is moved down.
SELECT datasketch_tdigest_rank(datasketch_tdigest(10, temp), 5) from readings;
┌────────────────────────────────────────────────────────────┐
│ datasketch_tdigest_rank(datasketch_tdigest(10, "temp"), 5) │
│ double │
├────────────────────────────────────────────────────────────┤
│ 0.32142857142857145 │
└────────────────────────────────────────────────────────────┘
-- Lets get the cumulative distribution function from the sketch.
SELECT datasketch_tdigest_cdf(datasketch_tdigest(10, temp), [1,5,9]) from readings;
┌──────────────────────────────────────────────────────────────────────────────────┐
│ datasketch_tdigest_cdf(datasketch_tdigest(10, "temp"), main.list_value(1, 5, 9)) │
│ double[] │
├──────────────────────────────────────────────────────────────────────────────────┤
│ [0.03571428571428571, 0.32142857142857145, 0.6071428571428571, 1.0] │
└──────────────────────────────────────────────────────────────────────────────────┘
-- The sketch can be persisted and updated later when more data
-- arrives without having to rescan the previously aggregated data.
SELECT datasketch_tdigest(10, temp) from readings;
datasketch_tdigest(10, "temp") = \x02\x01\x14\x0A\x00\x04\x00...
extension:
build: cmake
description: By utilizing the Apache DataSketches library this extension can efficiently compute approximate distinct item counts and estimations of quantiles, while allowing the sketches to be serialized.
language: C++
license: MIT
maintainers:
- rustyconover
name: datasketches
version: 0.0.1
repo:
github: rustyconover/duckdb-datasketches
ref: 23815cae5260957439a1ab5c59d23a8850d85424
3 changes: 3 additions & 0 deletions extensions/datasketches/docs/function_descriptions.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
function,description,comment,example
"crypto_hash","Apply a cryptographic hash function specified as the first argument to the data supplied as the second argument.","","SELECT crypto_hash('md5', 'test');"
"crypto_hmac","Calculate a HMAC value","","SELECT crypto_hmac('sha2-256', 'secret key', 'secret message');"

0 comments on commit 2139bb0

Please sign in to comment.