[doc](agg_state) add agg_state in data-model (apache#24348)

add agg_state in data-model
wyxxxcat · Sep 15, 2023 · f57c75f · f57c75f
1 parent 15c8ff1
commit f57c75f
Show file tree

Hide file tree

Showing 2 changed files with 219 additions and 2 deletions.
diff --git a/docs/en/docs/data-table/data-model.md b/docs/en/docs/data-table/data-model.md
@@ -88,12 +88,17 @@ As you can see, this is a typical fact table of user information and visit behav
 
 The columns in the table are divided into Key (dimension) columns and Value (indicator columns) based on whether they are set with an `AggregationType`. **Key** columns are not set with an  `AggregationType`, such as `user_id`, `date`, and  `age`, while **Value** columns are.
 
-When data are imported, rows with the same contents in the Key columns will be aggregated into one row, and their values in the Value columns will be aggregated as their `AggregationType` specify. Currently, their are four aggregation types:
+When data are imported, rows with the same contents in the Key columns will be aggregated into one row, and their values in the Value columns will be aggregated as their `AggregationType` specify. Currently, there are several aggregation methods and "agg_state" options available:
 
 1. SUM: Accumulate the values in multiple rows.
 2. REPLACE: The newly imported value will replace the previous value.
 3. MAX: Keep the maximum value.
 4. MIN: Keep the minimum value.
+5. REPLACE_IF_NOT_NULL: Non-null value replacement. Unlike REPLACE, it does not replace null values.
+6. HLL_UNION: Aggregation method for columns of HLL type, using the HyperLogLog algorithm for aggregation.
+7. BITMAP_UNION: Aggregation method for columns of BITMAP type, performing a union aggregation of bitmaps.
+
+If these aggregation methods cannot meet the requirements, you can choose to use the "agg_state" type.
 
 Suppose that you have the following import data (raw data):
 
@@ -268,6 +273,110 @@ In Doris, data aggregation happens in the following 3 stages:
 
 At different stages, data will be aggregated to varying degrees. For example, when a batch of data is just imported, it may not be aggregated with the existing data. But for users, they **can only query aggregated data**. That is, what users see are the aggregated data, and they **should not assume that what they have seen are not or partly aggregated**. (See the [Limitations of Aggregate Model](#Limitations of Aggregate Model) section for more details.)
 
+### agg_state
+
+    AGG_STATE cannot be used as a key column, and when creating a table, you need to declare the signature of the aggregation function. Users do not need to specify a length or default value. The actual storage size of the data depends on the function implementation.
+
+CREATE TABLE
+```sql
+set enable_agg_state=true;
+create table aggstate(
+    k1 int null,
+    k2 agg_state sum(int),
+    k3 agg_state group_concat(string)
+)
+aggregate key (k1)
+distributed BY hash(k1) buckets 3
+properties("replication_num" = "1");
+```
+
+
+"agg_state" is used to declare the data type as "agg_state," and "max_by/group_concat" are the signatures of aggregation functions.
+
+Please note that "agg_state" is a data type, similar to "int," "array," or "string."
+
+
+"agg_state" can only be used in conjunction with the [state](../sql-manual/sql-functions/combinators/state.md)/[merge](../sql-manual/sql-functions/combinators/merge.md)/[union](../sql-manual/sql-functions/combinators/union.md) function combinators.
+
+"agg_state" represents an intermediate result of an aggregation function. For example, with the aggregation function "sum," "agg_state" can represent the intermediate state of summing values like sum(1, 2, 3, 4, 5), rather than the final result.
+
+The "agg_state" type needs to be generated using the "state" function. For the current table, it would be "sum_state" and "group_concat_state" for the "sum" and "group_concat" aggregation functions, respectively.
+
+```sql
+insert into aggstate values(1,sum_state(1),group_concat_state('a'));
+insert into aggstate values(1,sum_state(2),group_concat_state('b'));
+insert into aggstate values(1,sum_state(3),group_concat_state('c'));
+```
+
+At this point, the table contains only one row. Please note that the table below is for illustrative purposes and cannot be selected/displayed directly:
+| k1      | k2        | k3 |               
+| --------------- | ----------- | --------------- | 
+| 1         | sum(1,2,3)    |   group_concat_state(a,b,c)              | 
+
+Insert another record.
+
+```sql
+insert into aggstate values(2,sum_state(4),group_concat_state('d'));
+```
+The table's structure at this moment is...
+| k1      | k2        | k3 |               
+| --------------- | ----------- | --------------- | 
+| 1         | sum(1,2,3)    |   group_concat_state(a,b,c)              | 
+| 2         | sum(4)    |   group_concat_state(d)              |
+
+We can use the 'merge' operation to combine multiple states and return the final result calculated by the aggregation function.
+
+```
+mysql> select sum_merge(k2) from aggstate;
++---------------+
+| sum_merge(k2) |
++---------------+
+|            10 |
++---------------+
+```
+`sum_merge` will first combine sum(1,2,3) and sum(4) into sum(1,2,3,4), and return the calculated result.
+
+Because `group_concat` has a specific order requirement, the result is not stable.
+```
+mysql> select group_concat_merge(k3) from aggstate;
++------------------------+
+| group_concat_merge(k3) |
++------------------------+
+| c,b,a,d                |
++------------------------+
+```
+
+If you do not want the final aggregation result, you can use 'union' to combine multiple intermediate aggregation results and generate a new intermediate result.
+```sql
+insert into aggstate select 3,sum_union(k2),group_concat_union(k3) from aggstate ;
+```
+The table's structure at this moment is...
+| k1      | k2        | k3 |               
+| --------------- | ----------- | --------------- | 
+| 1         | sum(1,2,3)    |   group_concat_state(a,b,c)              | 
+| 2         | sum(4)    |   group_concat_state(d)              |
+| 3         | sum(1,2,3,4)    |   group_concat_state(a,b,c,d)              |
+
+You can achieve this through a query.
+```
+mysql> select sum_merge(k2) , group_concat_merge(k3)from aggstate;
++---------------+------------------------+
+| sum_merge(k2) | group_concat_merge(k3) |
++---------------+------------------------+
+|            20 | c,b,a,d,c,b,a,d        |
++---------------+------------------------+
+
+mysql> select sum_merge(k2) , group_concat_merge(k3)from aggstate where k1 != 2;
++---------------+------------------------+
+| sum_merge(k2) | group_concat_merge(k3) |
++---------------+------------------------+
+|            16 | c,b,a,d,c,b,a          |
++---------------+------------------------+
+```
+Users can perform more detailed aggregation function operations using `agg_state`.
+
+Please note that `agg_state` comes with a certain performance overhead.
+
 ## Unique Model
 
 In some multi-dimensional analysis scenarios, users are highly concerned about how to ensure the uniqueness of the Key, that is, how to create uniqueness constraints for the Primary Key. Therefore, we introduce the Unique Model. Prior to Doris 1.2, the Unique Model was essentially a special case of the Aggregate Model and a simplified representation of table schema. The Aggregate Model is implemented by Merge on Read, so it might not deliver high performance in some aggregation queries (see the [Limitations of Aggregate Model] (#Limitations of Aggregate Model) section). In Doris 1.2, we have introduced a new implementation for the Unique Model--Merge on Write, which can help achieve optimal query performance. For now, Merge on Read and Merge on Write will coexist in the Unique Model for a while, but in the future, we plan to make Merge on Write the default implementation of the Unique Model. The following will illustrate the two implementations with examples.

diff --git a/docs/zh-CN/docs/data-table/data-model.md b/docs/zh-CN/docs/data-table/data-model.md
@@ -90,12 +90,18 @@ PROPERTIES (
 
 表中的列按照是否设置了 `AggregationType`，分为 Key (维度列) 和 Value（指标列）。没有设置 `AggregationType` 的，如 `user_id`、`date`、`age` ... 等称为 **Key**，而设置了 `AggregationType` 的称为 **Value**。
 
-当我们导入数据时，对于 Key 列相同的行会聚合成一行，而 Value 列会按照设置的 `AggregationType` 进行聚合。 `AggregationType` 目前有以下四种聚合方式：
+当我们导入数据时，对于 Key 列相同的行会聚合成一行，而 Value 列会按照设置的 `AggregationType` 进行聚合。 `AggregationType` 目前有以下几种聚合方式和agg_state：
 
 1. SUM：求和，多行的 Value 进行累加。
 2. REPLACE：替代，下一批数据中的 Value 会替换之前导入过的行中的 Value。
 3. MAX：保留最大值。
 4. MIN：保留最小值。
+5. REPLACE_IF_NOT_NULL：非空值替换。和 REPLACE 的区别在于对于null值，不做替换。
+6. HLL_UNION：HLL 类型的列的聚合方式，通过 HyperLogLog 算法聚合。
+7. BITMAP_UNION：BIMTAP 类型的列的聚合方式，进行位图的并集聚合。
+
+如果这几种聚合方式无法满足需求，则可以选择使用agg_state类型。
+
 
 假设我们有以下导入数据（原始数据）：
 
@@ -269,6 +275,108 @@ insert into example_db.example_tbl values
 
 数据在不同时间，可能聚合的程度不一致。比如一批数据刚导入时，可能还未与之前已存在的数据进行聚合。但是对于用户而言，用户**只能查询到**聚合后的数据。即不同的聚合程度对于用户查询而言是透明的。用户需始终认为数据以**最终的完成的聚合程度**存在，而**不应假设某些聚合还未发生**。（可参阅**聚合模型的局限性**一节获得更多详情。）
 
+### agg_state
+
+    AGG_STATE不能作为key列使用，建表时需要同时声明聚合函数的签名。
+    用户不需要指定长度和默认值。实际存储的数据大小与函数实现有关。
+
+建表
+```sql
+set enable_agg_state=true;
+create table aggstate(
+    k1 int null,
+    k2 agg_state sum(int),
+    k3 agg_state group_concat(string)
+)
+aggregate key (k1)
+distributed BY hash(k1) buckets 3
+properties("replication_num" = "1");
+```
+
+其中agg_state用于声明数据类型为agg_state，max_by/group_concat为聚合函数的签名。
+注意agg_state是一种数据类型，同int/array/string
+
+agg_state只能配合[state](../sql-manual/sql-functions/combinators/state.md)
+    /[merge](../sql-manual/sql-functions/combinators/merge.md)/[union](../sql-manual/sql-functions/combinators/union.md)函数组合器使用。
+
+agg_state是聚合函数的中间结果，例如，聚合函数sum ， 则agg_state可以表示sum(1,2,3,4,5)的这个中间状态，而不是最终的结果。
+
+agg_state类型需要使用state函数来生成，对于当前的这个表，则为`sum_state`,`group_concat_state`。
+
+```sql
+insert into aggstate values(1,sum_state(1),group_concat_state('a'));
+insert into aggstate values(1,sum_state(2),group_concat_state('b'));
+insert into aggstate values(1,sum_state(3),group_concat_state('c'));
+```
+此时表只有一行 ( 注意，下面的表只是示意图，不是真的可以select显示出来)
+| k1      | k2        | k3 |               
+| --------------- | ----------- | --------------- | 
+| 1         | sum(1,2,3)    |   group_concat_state(a,b,c)              | 
+
+再插入一条数据
+
+```sql
+insert into aggstate values(2,sum_state(4),group_concat_state('d'));
+```
+此时表的结构为
+| k1      | k2        | k3 |               
+| --------------- | ----------- | --------------- | 
+| 1         | sum(1,2,3)    |   group_concat_state(a,b,c)              | 
+| 2         | sum(4)    |   group_concat_state(d)              |
+
+我们可以通过merge操作来合并多个state，并且返回最终聚合函数计算的结果
+
+```
+mysql> select sum_merge(k2) from aggstate;
++---------------+
+| sum_merge(k2) |
++---------------+
+|            10 |
++---------------+
+```
+`sum_merge` 会先把sum(1,2,3) 和 sum(4) 合并成 sum(1,2,3,4) ，并返回计算的结果。
+
+因为group_concat对于顺序有要求，所以结果是不稳定的。
+```
+mysql> select group_concat_merge(k3) from aggstate;
++------------------------+
+| group_concat_merge(k3) |
++------------------------+
+| c,b,a,d                |
++------------------------+
+```
+
+如果不想要聚合的最终结果，可以使用union来合并多个聚合的中间结果，生成一个新的中间结果。
+```sql
+insert into aggstate select 3,sum_union(k2),group_concat_union(k3) from aggstate ;
+```
+此时的表结构为
+| k1      | k2        | k3 |               
+| --------------- | ----------- | --------------- | 
+| 1         | sum(1,2,3)    |   group_concat_state(a,b,c)              | 
+| 2         | sum(4)    |   group_concat_state(d)              |
+| 3         | sum(1,2,3,4)    |   group_concat_state(a,b,c,d)              |
+
+可以通过查询
+```
+mysql> select sum_merge(k2) , group_concat_merge(k3)from aggstate;
++---------------+------------------------+
+| sum_merge(k2) | group_concat_merge(k3) |
++---------------+------------------------+
+|            20 | c,b,a,d,c,b,a,d        |
++---------------+------------------------+
+
+mysql> select sum_merge(k2) , group_concat_merge(k3)from aggstate where k1 != 2;
++---------------+------------------------+
+| sum_merge(k2) | group_concat_merge(k3) |
++---------------+------------------------+
+|            16 | c,b,a,d,c,b,a          |
++---------------+------------------------+
+```
+用户可以通过agg_state做出跟细致的聚合函数操作。
+
+注意 agg_state 存在一定的性能开销
+
 ## Unique 模型
 
 在某些多维分析场景下，用户更关注的是如何保证 Key 的唯一性，即如何获得 Primary Key 唯一性约束。因此，我们引入了 Unique 数据模型。在1.2版本之前，该模型本质上是聚合模型的一个特例，也是一种简化的表结构表示方式。由于聚合模型的实现方式是读时合并（merge on read)，因此在一些聚合查询上性能不佳（参考后续章节[聚合模型的局限性](#聚合模型的局限性)的描述），在1.2版本我们引入了Unique模型新的实现方式，写时合并（merge on write），通过在写入时做一些额外的工作，实现了最优的查询性能。写时合并将在未来替换读时合并成为Unique模型的默认实现方式，两者将会短暂的共存一段时间。下面将对两种实现方式分别举例进行说明。