[docs] Add doc about first row merge engin

alibaba · Dec 25, 2024 · 816d815 · 816d815
1 parent 7f37bbb
commit 816d815
Show file tree

Hide file tree

Showing 6 changed files with 153 additions and 91 deletions.
diff --git a/website/docs/table-design/table-types/pk-table.md b/website/docs/table-design/table-types/pk-table.md
diff --git a/website/docs/table-design/table-types/pk-table/_category_.json b/website/docs/table-design/table-types/pk-table/_category_.json
@@ -0,0 +1,4 @@
+{
+  "label": "PrimaryKey Table",
+  "position": 1
+}
diff --git a/website/docs/table-design/table-types/pk-table/merge-engine/_category_.json b/website/docs/table-design/table-types/pk-table/merge-engine/_category_.json
@@ -0,0 +1,4 @@
+{
+  "label": "Merge Engine",
+  "position": 2
+}
diff --git a/website/docs/table-design/table-types/pk-table/merge-engine/first-row.md b/website/docs/table-design/table-types/pk-table/merge-engine/first-row.md
@@ -0,0 +1,16 @@
+---
+sidebar_label: First Row
+sidebar_position: 2
+---
+
+# First Row
+
+By specifying `'table.merge-engine' = 'first_row'`, users can keep the first row of the same primary key. It'll only
+generate insert only change log, so that the downstream table of it can be append-only table.
+
+:::note
+When using `first_row` merge engine, there are the following limits:
+
+- `UPDATE` and `DELETE` statements are not supported
+- Partial update is not supported
+  :::
diff --git a/website/docs/table-design/table-types/pk-table/merge-engine/overview.md b/website/docs/table-design/table-types/pk-table/merge-engine/overview.md
@@ -0,0 +1,15 @@
+---
+sidebar_label: Overview
+sidebar_position: 1
+---
+
+# Overview
+
+When Fluss sink receives two or more rows with the same primary key for primary key table, it will merge them into
+one row to keep primary key unique.
+By default, it will only keep the latest row in the table. But by specifying the `table.merge-engine` table property,
+users can choose how rows are merge into one row.
+
+The following merge engines are supported:
+
+1. [First Row](table-design/table-types/pk-table/merge-engine/first-row.md)
diff --git a/website/docs/table-design/table-types/pk-table/overview.md b/website/docs/table-design/table-types/pk-table/overview.md
@@ -0,0 +1,114 @@
+---
+sidebar_position: 1
+---
+
+# Overview
+
+## Basic Concept
+
+PrimaryKey Table in Fluss ensure the uniqueness of the specified primary key and supports `INSERT`, `UPDATE`,
+and `DELETE` operations.
+
+A PrimaryKey Table is created by specifying a `PRIMARY KEY` clause in the `CREATE TABLE` statement. For example, the
+following Flink SQL statement creates a PrimaryKey Table with `shop_id` and `user_id` as the primary key and distributes
+the data into 4 buckets:
+
+```sql title="Flink SQL"
+CREATE TABLE pk_table
+(
+    shop_id      BIGINT,
+    user_id      BIGINT,
+    num_orders   INT,
+    total_amount INT,
+    PRIMARY KEY (shop_id, user_id) NOT ENFORCED
+) WITH (
+      'bucket.num' = '4'
+      );
+```
+
+In Fluss primary key table, each row of data has a unique primary key.
+If multiple entries with the same primary key are written to the Fluss primary key table, only the last entry will be
+retained.
+
+For [Partitioned PrimaryKey Table](table-design/data-distribution/partitioning.md), the primary key must contain the
+partition key.
+
+## Bucket Assigning
+
+For primary key tables, Fluss always determines which bucket the data belongs to based on the hash value of the primary
+key for each record.
+Data with the same hash value will be distributed to the same bucket.
+
+## Partial Update
+
+For primary key tables, Fluss supports partial column updates, allowing you to write only a subset of columns to
+incrementally update the data and ultimately achieve complete data. Note that the columns being written must include the
+primary key column.
+
+For example, consider the following Fluss primary key table:
+
+```sql title="Flink SQL"
+CREATE TABLE T
+(
+    k  INT,
+    v1 DOUBLE,
+    v2 STRING,
+    PRIMARY KEY (k) NOT ENFORCED
+);
+```
+
+Assuming that at the beginning, only the `k` and `v1` columns are written with the data `+I(1, 2.0)`, `+I(2, 3.0)`, the
+data in T is as follows:
+
+| k | v1  | v2   |
+|---|-----|------|
+| 1 | 2.0 | null |
+| 2 | 3.0 | null |
+
+Then write to the `k` and `v2` columns with the data `+I(1, 't1')`, `+I(2, 't2')`, resulting in the data in T as
+follows:
+
+| k | v1  | v2 |
+|---|-----|----|
+| 1 | 2.0 | t1 |
+| 2 | 3.0 | t2 |
+
+## Data Queries
+
+For primary key tables, Fluss supports querying data directly based on the key. Please refer to
+the [Flink Reads](../../../engine-flink/reads.md) for detailed instructions.
+
+## Changelog Generation
+
+Fluss will capture the changes when inserting, updating, deleting records on the primary-key table, which is known as
+the changelog. Downstream consumers can directly consume the changelog to obtain the changes in the table. For example,
+consider the following primary key table in Fluss:
+
+```sql title="Flink SQL"
+CREATE TABLE T
+(
+    k  INT,
+    v1 DOUBLE,
+    v2 STRING,
+    PRIMARY KEY (k) NOT ENFORCED
+);
+```
+
+If the data written to the primary-key table is
+sequentially `+I(1, 2.0, 'apple')`, `+I(1, 4.0, 'banana')`, `-D(1, 4.0, 'banana')`, then the following change data will
+be generated.
+
+```text
++I(1, 2.0, 'apple')
+-U(1, 2.0, 'apple')
++U(1, 4.0, 'banana')
+-D(1, 4.0, 'banana')
+```
+
+## Data Consumption
+
+For a primary key table, the default consumption method is a full snapshot followed by incremental data. First, the
+snapshot data of the table is consumed, followed by the binlog data of the table.
+
+It is also possible to only consume the binlog data of the table. For more details, please refer to
+the [Flink Reads](../../../engine-flink/reads.md)