Resolve code conflicts.

apache · Apr 26, 2024 · c6abd4c · c6abd4c
2 parents 355c78e + 4e63f55
commit c6abd4c
Show file tree

Hide file tree

Showing 394 changed files with 12,368 additions and 6,233 deletions.
diff --git a/.github/workflows/docs-tests.yml b/.github/workflows/docs-tests.yml
@@ -35,7 +35,7 @@ jobs:
       - name: Setup Hugo
         uses: peaceiris/actions-hugo@v2
         with:
-          hugo-version: 'latest'
+          hugo-version: '0.124.1'
           extended: true
 
       - name: Build

diff --git a/docs/content/_index.md b/docs/content/_index.md
@@ -46,7 +46,7 @@ Paimon offers the following core capabilities:
 ## Try Paimon
 
 If you’re interested in playing around with Paimon, check out our
-quick start guide with [Flink]({{< ref "engines/flink" >}}), [Spark]({{< ref "engines/spark" >}}) or [Hive]({{< ref "engines/hive" >}}). It provides a step by
+quick start guide with [Flink]({{< ref "flink/quick-start" >}}) or [Spark]({{< ref "spark/quick-start" >}}). It provides a step by
 step introduction to the APIs and guides you through real applications.
 
 <--->

diff --git a/docs/content/concepts/concurrency-control.md b/docs/content/concepts/concurrency-control.md
@@ -0,0 +1,66 @@
+---
+title: "Concurrency Control"
+weight: 3
+type: docs
+aliases:
+- /concepts/concurrency-control.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Concurrency Control
+
+Paimon supports optimistic concurrency for multiple concurrent write jobs.
+
+Each job writes data at its own pace and generates a new snapshot based on the current snapshot by applying incremental
+files (deleting or adding files) at the time of committing.
+
+There may be two types of commit failures here:
+1. Snapshot conflict: the snapshot id has been preempted, the table has generated a new snapshot from another job. OK, let's commit again.
+2. Files conflict: The file that this job wants to delete has been deleted by another jobs. At this point, the job can only fail. (For streaming jobs, it will fail and restart, intentionally failover once)
+
+## Snapshot conflict
+
+Paimon's snapshot ID is unique, so as long as the job writes its snapshot file to the file system, it is considered successful.
+
+{{< img src="/img/snapshot-conflict.png">}}
+
+Paimon uses the file system's renaming mechanism to commit snapshots, which is secure for HDFS as it ensures
+transactional and atomic renaming.
+
+But for object storage such as OSS and S3, their `'RENAME'` does not have atomic semantic. We need to configure Hive or
+jdbc metastore and enable `'lock.enabled'` option for the catalog. Otherwise, there may be a chance of losing the snapshot.
+
+## Files conflict
+
+When Paimon commits a file deletion (which is only a logical deletion), it checks for conflicts with the latest snapshot.
+If there are conflicts (which means the file has been logically deleted), it can no longer continue on this commit node,
+so it can only intentionally trigger a failover to restart, and the job will retrieve the latest status from the filesystem
+in the hope of resolving this conflict.
+
+{{< img src="/img/files-conflict.png">}}
+
+Paimon will ensure that there is no data loss or duplication here, but if two streaming jobs are writing at the same
+time and there are conflicts, you will see that they are constantly restarting, which is not a good thing.
+
+The essence of conflict lies in deleting files (logically), and deleting files is born from compaction, so as long as
+we close the compaction of the writing job (Set 'write-only' to true) and start a separate job to do the compaction work,
+everything is very good.
+
+See [dedicated compaction job]({{< ref "maintenance/dedicated-compaction#dedicated-compaction-job" >}}) for more info.
diff --git a/docs/content/engines/_index.md b/docs/content/engines/_index.md
@@ -1,9 +1,9 @@
 ---
-title: Engines
+title: Engine Others
 icon: <i class="fa fa-gear title maindish" aria-hidden="true"></i>
 bold: true
 bookCollapseSection: true
-weight: 4
+weight: 90
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one

diff --git a/docs/content/engines/hive.md b/docs/content/engines/hive.md
@@ -84,60 +84,6 @@ NOTE:
 * If you are using HDFS, make sure that the environment variable `HADOOP_HOME` or `HADOOP_CONF_DIR` is set.
 * With hive cbo, it may lead to some incorrect query results, such as to query `struct` type with `not null` predicate, you can disable the cbo by `set hive.cbo.enable=false;` command.
 
-## Flink SQL: with Paimon Hive Catalog 
-
-By using paimon Hive catalog, you can create, drop, select and insert into paimon tables from Flink. These operations directly affect the corresponding Hive metastore. Tables created in this way can also be accessed directly from Hive.
-
-**Step 1: Prepare Flink Hive Connector Bundled Jar**
-
-See [creating a catalog with Hive metastore]({{< ref "how-to/creating-catalogs#creating-a-catalog-with-hive-metastore" >}}).
-
-**Step 2: Create Test Data with Flink SQL**
-
-Execute the following Flink SQL script in Flink SQL client to define a Paimon Hive catalog and create a table.
-
-```sql
--- Flink SQL CLI
--- Define paimon Hive catalog
-
-CREATE CATALOG my_hive WITH (
-  'type' = 'paimon',
-  'metastore' = 'hive',
-  -- 'uri' = 'thrift://<hive-metastore-host-name>:<port>', default use 'hive.metastore.uris' in HiveConf
-  -- 'hive-conf-dir' = '...', this is recommended in the kerberos environment
-  -- 'hadoop-conf-dir' = '...', this is recommended in the kerberos environment
-  -- 'warehouse' = 'hdfs:///path/to/table/store/warehouse', default use 'hive.metastore.warehouse.dir' in HiveConf
-);
-
--- Use paimon Hive catalog
-
-USE CATALOG my_hive;
-
--- Create a table in paimon Hive catalog (use "default" database by default)
-
-CREATE TABLE test_table (
-  a int,
-  b string
-);
-
--- Insert records into test table
-
-INSERT INTO test_table VALUES (1, 'Table'), (2, 'Store');
-
--- Read records from test table
-
-SELECT * FROM test_table;
-
-/*
-+---+-------+
-| a |     b |
-+---+-------+
-| 1 | Table |
-| 2 | Store |
-+---+-------+
-*/
-```
-
 ## Hive SQL: access Paimon Tables already in Hive metastore
 
 Run the following Hive SQL in Hive CLI to access the created table.
@@ -165,7 +111,10 @@ OK
 */
 
 -- Insert records into test table
--- Note tez engine does not support hive write, only the hive engine is supported.
+-- Limitations:
+-- Only support INSERT INTO, not support INSERT OVERWRITE
+-- It is recommended to write to a non primary key table
+-- Writing to a primary key table may result in a large number of small files
 
 INSERT INTO test_table VALUES (3, 'Paimon');
 

diff --git a/docs/content/engines/overview.md b/docs/content/engines/overview.md
@@ -26,25 +26,56 @@ under the License.
 
 # Overview
 
-Paimon not only supports Flink SQL writes and queries natively,
-but also provides queries from other popular engines, such as
-Apache Spark and Apache Hive.
-
 ## Compatibility Matrix
 
-|                                     Engine                                      |    Version    | Batch Read | Batch Write | Create Table | Alter Table | Streaming Write | Streaming Read | Batch Overwrite |
-|:-------------------------------------------------------------------------------:|:-------------:|:----------:|:-----------:|:------------:|:-----------:|:---------------:|:--------------:|:---------------:|
-|                                      Flink                                      |  1.15 - 1.19  |     ✅      |      ✅      |      ✅       |  ✅(1.17+)   |        ✅        |       ✅        |        ✅        |
-|                                      Spark                                      |   3.1 - 3.5   |     ✅      |      ✅      |      ✅       |      ✅      |        ✅        |    ✅(3.3+)     |        ✅        |
-|                                      Hive                                       |   2.1 - 3.1   |     ✅      |      ✅      |      ✅       |      ❌      |        ❌        |       ❌        |        ❌        |
-|                                      Spark                                      |      2.4      |     ✅      |      ❌      |      ❌       |      ❌      |        ❌        |       ❌        |        ❌        |
-|                                      Trino                                      |   422 - 426   |     ✅      |      ❌      |      ❌       |      ❌      |        ❌        |       ❌        |        ❌        |
-|                                      Trino                                      |   427 - 439   |     ✅      |      ❌      |      ✅       |      ✅      |        ❌        |       ❌        |        ❌        |
-|                                     Presto                                      | 0.236 - 0.280 |     ✅      |      ❌      |      ✅       |      ✅      |        ❌        |       ❌        |        ❌        |
-| [StarRocks](https://docs.starrocks.io/docs/data_source/catalog/paimon_catalog/) |     3.1+      |     ✅      |      ❌      |      ❌       |      ❌      |        ❌        |       ❌        |        ❌        |
-|     [Doris](https://doris.apache.org/docs/lakehouse/multi-catalog/paimon/)      |     2.0+      |     ✅      |      ❌      |      ❌       |      ❌      |        ❌        |       ❌        |        ❌        |
-
-Recommended versions are Flink 1.17.2, Spark 3.5.0, Hive 2.3.9
+|                                     Engine                                      |    Version    | Batch Read  | Batch Write  | Create Table | Alter Table  | Streaming Write  | Streaming Read | Batch Overwrite  | DELETE & UPDATE  | MERGE INTO  |
+|:-------------------------------------------------------------------------------:|:-------------:|:-----------:|:------------:|:------------:|:------------:|:----------------:|:--------------:|:----------------:|:----------------:|:-----------:|
+|                                      Flink                                      |  1.15 - 1.19  |     ✅      |      ✅      |      ✅       |  ✅(1.17+)   |        ✅        |       ✅        |        ✅        |    ✅(1.17+)      |      ❌     |
+|                                      Spark                                      |   3.1 - 3.5   |     ✅      |   ✅(3.3+)   |      ✅       |      ✅      |      ✅(3.3+)    |    ✅(3.3+)     |      ✅(3.3+)    |     ✅(3.2+)      |   ✅(3.2+)  |
+|                                      Hive                                       |   2.1 - 3.1   |     ✅      |      ✅      |      ✅       |      ❌      |        ❌        |       ❌        |        ❌        |         ❌        |      ❌     |
+|                                      Trino                                      |   420 - 426   |     ✅      |      ❌      |      ❌       |      ❌      |        ❌        |       ❌        |        ❌        |         ❌        |      ❌     |
+|                                      Trino                                      |   427 - 439   |     ✅      |      ❌      |      ✅       |      ✅      |        ❌        |       ❌        |        ❌        |         ❌        |      ❌     |
+|                                     Presto                                      | 0.236 - 0.280 |     ✅      |      ❌      |      ✅       |      ✅      |        ❌        |       ❌        |        ❌        |         ❌        |      ❌     |
+| [StarRocks](https://docs.starrocks.io/docs/data_source/catalog/paimon_catalog/) |     3.1+      |     ✅      |      ❌      |      ❌       |      ❌      |        ❌        |       ❌        |        ❌        |         ❌        |      ❌     |
+|     [Doris](https://doris.apache.org/docs/lakehouse/multi-catalog/paimon/)      |     2.0+      |     ✅      |      ❌      |      ❌       |      ❌      |        ❌        |       ❌        |        ❌        |         ❌        |      ❌     |
+
+## Streaming Engines
+
+### Flink Streaming
+
+Flink is the most comprehensive streaming computing engine that is widely used for data CDC ingestion and the
+construction of streaming pipelines.
+
+Recommended version is Flink 1.17.2.
+
+### Spark Streaming
+
+You can also use Spark Streaming to build a streaming pipeline. Spark's schema evolution capability will be better
+implemented, but you must accept the mechanism of mini-batch.
+
+## Batch Engines
+
+### Spark Batch
+
+Spark Batch is the most widely used batch computing engine.
+
+Recommended version is Spark 3.4.3.
+
+### Flink Batch
+
+Flink Batch is also available, which can make your pipeline more integrated with streaming and batch unified.
+
+## OLAP Engines
+
+### StarRocks
+
+StarRocks is the most recommended OLAP engine with the most advanced integration.
+
+Recommended version is StarRocks 3.2.6.
+
+### Other OLAP
+
+You can also use Doris and Trino and Presto, or, you can just use Spark, Flink and Hive to query Paimon tables.
 
 ## Download
 

diff --git a/docs/content/engines/presto.md b/docs/content/engines/presto.md
@@ -1,6 +1,6 @@
 ---
 title: "Presto"
-weight: 5
+weight: 6
 type: docs
 aliases:
 - /engines/presto.html