From 816d815fb377f26b8faa030ca8768dd858f4915c Mon Sep 17 00:00:00 2001 From: luoyuxia Date: Wed, 25 Dec 2024 14:50:29 +0800 Subject: [PATCH] [docs] Add doc about first row merge engin --- .../docs/table-design/table-types/pk-table.md | 91 -------------- .../table-types/pk-table/_category_.json | 4 + .../pk-table/merge-engine/_category_.json | 4 + .../pk-table/merge-engine/first-row.md | 16 +++ .../pk-table/merge-engine/overview.md | 15 +++ .../table-types/pk-table/overview.md | 114 ++++++++++++++++++ 6 files changed, 153 insertions(+), 91 deletions(-) delete mode 100644 website/docs/table-design/table-types/pk-table.md create mode 100644 website/docs/table-design/table-types/pk-table/_category_.json create mode 100644 website/docs/table-design/table-types/pk-table/merge-engine/_category_.json create mode 100644 website/docs/table-design/table-types/pk-table/merge-engine/first-row.md create mode 100644 website/docs/table-design/table-types/pk-table/merge-engine/overview.md create mode 100644 website/docs/table-design/table-types/pk-table/overview.md diff --git a/website/docs/table-design/table-types/pk-table.md b/website/docs/table-design/table-types/pk-table.md deleted file mode 100644 index f3bc4750..00000000 --- a/website/docs/table-design/table-types/pk-table.md +++ /dev/null @@ -1,91 +0,0 @@ ---- -sidebar_position: 2 ---- - -# PrimaryKey Table - -## Basic Concept - -PrimaryKey Table in Fluss ensure the uniqueness of the specified primary key and supports `INSERT`, `UPDATE`, and `DELETE` operations. - -A PrimaryKey Table is created by specifying a `PRIMARY KEY` clause in the `CREATE TABLE` statement. For example, the following Flink SQL statement creates a PrimaryKey Table with `shop_id` and `user_id` as the primary key and distributes the data into 4 buckets: - -```sql title="Flink SQL" -CREATE TABLE pk_table ( - shop_id BIGINT, - user_id BIGINT, - num_orders INT, - total_amount INT, - PRIMARY KEY (shop_id, user_id) NOT ENFORCED -) WITH ( - 'bucket.num' = '4' -); -``` - -In Fluss primary key table, each row of data has a unique primary key. -If multiple entries with the same primary key are written to the Fluss primary key table, only the last entry will be retained. - -For [Partitioned PrimaryKey Table](table-design/data-distribution/partitioning.md), the primary key must contain the partition key. - -## Bucket Assigning - -For primary key tables, Fluss always determines which bucket the data belongs to based on the hash value of the primary key for each record. -Data with the same hash value will be distributed to the same bucket. - -## Partial Update -For primary key tables, Fluss supports partial column updates, allowing you to write only a subset of columns to incrementally update the data and ultimately achieve complete data. Note that the columns being written must include the primary key column. - -For example, consider the following Fluss primary key table: -```sql title="Flink SQL" -CREATE TABLE T ( - k INT, - v1 DOUBLE, - v2 STRING, - PRIMARY KEY (k) NOT ENFORCED -); -``` - -Assuming that at the beginning, only the `k` and `v1` columns are written with the data `+I(1, 2.0)`, `+I(2, 3.0)`, the data in T is as follows: - -| k | v1 | v2 | -|---|-----|------| -| 1 | 2.0 | null | -| 2 | 3.0 | null | - -Then write to the `k` and `v2` columns with the data `+I(1, 't1')`, `+I(2, 't2')`, resulting in the data in T as follows: - -| k | v1 | v2 | -|---|-----|----| -| 1 | 2.0 | t1 | -| 2 | 3.0 | t2 | - -## Data Queries - -For primary key tables, Fluss supports querying data directly based on the key. Please refer to the [Flink Reads](../../engine-flink/reads.md) for detailed instructions. - -## Changelog Generation - -Fluss will capture the changes when inserting, updating, deleting records on the primary-key table, which is known as the changelog. Downstream consumers can directly consume the changelog to obtain the changes in the table. For example, consider the following primary key table in Fluss: - -```sql title="Flink SQL" -CREATE TABLE T ( - k INT, - v1 DOUBLE, - v2 STRING, - PRIMARY KEY (k) NOT ENFORCED -); -``` - -If the data written to the primary-key table is sequentially `+I(1, 2.0, 'apple')`, `+I(1, 4.0, 'banana')`, `-D(1, 4.0, 'banana')`, then the following change data will be generated. - -```text -+I(1, 2.0, 'apple') --U(1, 2.0, 'apple') -+U(1, 4.0, 'banana') --D(1, 4.0, 'banana') -``` - -## Data Consumption -For a primary key table, the default consumption method is a full snapshot followed by incremental data. First, the snapshot data of the table is consumed, followed by the binlog data of the table. - -It is also possible to only consume the binlog data of the table. For more details, please refer to the [Flink Reads](../../engine-flink/reads.md) diff --git a/website/docs/table-design/table-types/pk-table/_category_.json b/website/docs/table-design/table-types/pk-table/_category_.json new file mode 100644 index 00000000..2aade25d --- /dev/null +++ b/website/docs/table-design/table-types/pk-table/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "PrimaryKey Table", + "position": 1 +} diff --git a/website/docs/table-design/table-types/pk-table/merge-engine/_category_.json b/website/docs/table-design/table-types/pk-table/merge-engine/_category_.json new file mode 100644 index 00000000..66d04b3c --- /dev/null +++ b/website/docs/table-design/table-types/pk-table/merge-engine/_category_.json @@ -0,0 +1,4 @@ +{ + "label": "Merge Engine", + "position": 2 +} diff --git a/website/docs/table-design/table-types/pk-table/merge-engine/first-row.md b/website/docs/table-design/table-types/pk-table/merge-engine/first-row.md new file mode 100644 index 00000000..7baa29f3 --- /dev/null +++ b/website/docs/table-design/table-types/pk-table/merge-engine/first-row.md @@ -0,0 +1,16 @@ +--- +sidebar_label: First Row +sidebar_position: 2 +--- + +# First Row + +By specifying `'table.merge-engine' = 'first_row'`, users can keep the first row of the same primary key. It'll only +generate insert only change log, so that the downstream table of it can be append-only table. + +:::note +When using `first_row` merge engine, there are the following limits: + +- `UPDATE` and `DELETE` statements are not supported +- Partial update is not supported + ::: \ No newline at end of file diff --git a/website/docs/table-design/table-types/pk-table/merge-engine/overview.md b/website/docs/table-design/table-types/pk-table/merge-engine/overview.md new file mode 100644 index 00000000..4375ddc5 --- /dev/null +++ b/website/docs/table-design/table-types/pk-table/merge-engine/overview.md @@ -0,0 +1,15 @@ +--- +sidebar_label: Overview +sidebar_position: 1 +--- + +# Overview + +When Fluss sink receives two or more rows with the same primary key for primary key table, it will merge them into +one row to keep primary key unique. +By default, it will only keep the latest row in the table. But by specifying the `table.merge-engine` table property, +users can choose how rows are merge into one row. + +The following merge engines are supported: + +1. [First Row](table-design/table-types/pk-table/merge-engine/first-row.md) \ No newline at end of file diff --git a/website/docs/table-design/table-types/pk-table/overview.md b/website/docs/table-design/table-types/pk-table/overview.md new file mode 100644 index 00000000..855efac4 --- /dev/null +++ b/website/docs/table-design/table-types/pk-table/overview.md @@ -0,0 +1,114 @@ +--- +sidebar_position: 1 +--- + +# Overview + +## Basic Concept + +PrimaryKey Table in Fluss ensure the uniqueness of the specified primary key and supports `INSERT`, `UPDATE`, +and `DELETE` operations. + +A PrimaryKey Table is created by specifying a `PRIMARY KEY` clause in the `CREATE TABLE` statement. For example, the +following Flink SQL statement creates a PrimaryKey Table with `shop_id` and `user_id` as the primary key and distributes +the data into 4 buckets: + +```sql title="Flink SQL" +CREATE TABLE pk_table +( + shop_id BIGINT, + user_id BIGINT, + num_orders INT, + total_amount INT, + PRIMARY KEY (shop_id, user_id) NOT ENFORCED +) WITH ( + 'bucket.num' = '4' + ); +``` + +In Fluss primary key table, each row of data has a unique primary key. +If multiple entries with the same primary key are written to the Fluss primary key table, only the last entry will be +retained. + +For [Partitioned PrimaryKey Table](table-design/data-distribution/partitioning.md), the primary key must contain the +partition key. + +## Bucket Assigning + +For primary key tables, Fluss always determines which bucket the data belongs to based on the hash value of the primary +key for each record. +Data with the same hash value will be distributed to the same bucket. + +## Partial Update + +For primary key tables, Fluss supports partial column updates, allowing you to write only a subset of columns to +incrementally update the data and ultimately achieve complete data. Note that the columns being written must include the +primary key column. + +For example, consider the following Fluss primary key table: + +```sql title="Flink SQL" +CREATE TABLE T +( + k INT, + v1 DOUBLE, + v2 STRING, + PRIMARY KEY (k) NOT ENFORCED +); +``` + +Assuming that at the beginning, only the `k` and `v1` columns are written with the data `+I(1, 2.0)`, `+I(2, 3.0)`, the +data in T is as follows: + +| k | v1 | v2 | +|---|-----|------| +| 1 | 2.0 | null | +| 2 | 3.0 | null | + +Then write to the `k` and `v2` columns with the data `+I(1, 't1')`, `+I(2, 't2')`, resulting in the data in T as +follows: + +| k | v1 | v2 | +|---|-----|----| +| 1 | 2.0 | t1 | +| 2 | 3.0 | t2 | + +## Data Queries + +For primary key tables, Fluss supports querying data directly based on the key. Please refer to +the [Flink Reads](../../../engine-flink/reads.md) for detailed instructions. + +## Changelog Generation + +Fluss will capture the changes when inserting, updating, deleting records on the primary-key table, which is known as +the changelog. Downstream consumers can directly consume the changelog to obtain the changes in the table. For example, +consider the following primary key table in Fluss: + +```sql title="Flink SQL" +CREATE TABLE T +( + k INT, + v1 DOUBLE, + v2 STRING, + PRIMARY KEY (k) NOT ENFORCED +); +``` + +If the data written to the primary-key table is +sequentially `+I(1, 2.0, 'apple')`, `+I(1, 4.0, 'banana')`, `-D(1, 4.0, 'banana')`, then the following change data will +be generated. + +```text ++I(1, 2.0, 'apple') +-U(1, 2.0, 'apple') ++U(1, 4.0, 'banana') +-D(1, 4.0, 'banana') +``` + +## Data Consumption + +For a primary key table, the default consumption method is a full snapshot followed by incremental data. First, the +snapshot data of the table is consumed, followed by the binlog data of the table. + +It is also possible to only consume the binlog data of the table. For more details, please refer to +the [Flink Reads](../../../engine-flink/reads.md)