Skip to content

Commit

Permalink
Resolve code conflicts.
Browse files Browse the repository at this point in the history
  • Loading branch information
zhuangchong committed Apr 26, 2024
2 parents 355c78e + 4e63f55 commit c6abd4c
Show file tree
Hide file tree
Showing 394 changed files with 12,368 additions and 6,233 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/docs-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ jobs:
- name: Setup Hugo
uses: peaceiris/actions-hugo@v2
with:
hugo-version: 'latest'
hugo-version: '0.124.1'
extended: true

- name: Build
Expand Down
2 changes: 1 addition & 1 deletion docs/content/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Paimon offers the following core capabilities:
## Try Paimon

If you’re interested in playing around with Paimon, check out our
quick start guide with [Flink]({{< ref "engines/flink" >}}), [Spark]({{< ref "engines/spark" >}}) or [Hive]({{< ref "engines/hive" >}}). It provides a step by
quick start guide with [Flink]({{< ref "flink/quick-start" >}}) or [Spark]({{< ref "spark/quick-start" >}}). It provides a step by
step introduction to the APIs and guides you through real applications.

<--->
Expand Down
66 changes: 66 additions & 0 deletions docs/content/concepts/concurrency-control.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: "Concurrency Control"
weight: 3
type: docs
aliases:
- /concepts/concurrency-control.html
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Concurrency Control

Paimon supports optimistic concurrency for multiple concurrent write jobs.

Each job writes data at its own pace and generates a new snapshot based on the current snapshot by applying incremental
files (deleting or adding files) at the time of committing.

There may be two types of commit failures here:
1. Snapshot conflict: the snapshot id has been preempted, the table has generated a new snapshot from another job. OK, let's commit again.
2. Files conflict: The file that this job wants to delete has been deleted by another jobs. At this point, the job can only fail. (For streaming jobs, it will fail and restart, intentionally failover once)

## Snapshot conflict

Paimon's snapshot ID is unique, so as long as the job writes its snapshot file to the file system, it is considered successful.

{{< img src="/img/snapshot-conflict.png">}}

Paimon uses the file system's renaming mechanism to commit snapshots, which is secure for HDFS as it ensures
transactional and atomic renaming.

But for object storage such as OSS and S3, their `'RENAME'` does not have atomic semantic. We need to configure Hive or
jdbc metastore and enable `'lock.enabled'` option for the catalog. Otherwise, there may be a chance of losing the snapshot.

## Files conflict

When Paimon commits a file deletion (which is only a logical deletion), it checks for conflicts with the latest snapshot.
If there are conflicts (which means the file has been logically deleted), it can no longer continue on this commit node,
so it can only intentionally trigger a failover to restart, and the job will retrieve the latest status from the filesystem
in the hope of resolving this conflict.

{{< img src="/img/files-conflict.png">}}

Paimon will ensure that there is no data loss or duplication here, but if two streaming jobs are writing at the same
time and there are conflicts, you will see that they are constantly restarting, which is not a good thing.

The essence of conflict lies in deleting files (logically), and deleting files is born from compaction, so as long as
we close the compaction of the writing job (Set 'write-only' to true) and start a separate job to do the compaction work,
everything is very good.

See [dedicated compaction job]({{< ref "maintenance/dedicated-compaction#dedicated-compaction-job" >}}) for more info.
4 changes: 2 additions & 2 deletions docs/content/engines/_index.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
title: Engines
title: Engine Others
icon: <i class="fa fa-gear title maindish" aria-hidden="true"></i>
bold: true
bookCollapseSection: true
weight: 4
weight: 90
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
Expand Down
59 changes: 4 additions & 55 deletions docs/content/engines/hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,60 +84,6 @@ NOTE:
* If you are using HDFS, make sure that the environment variable `HADOOP_HOME` or `HADOOP_CONF_DIR` is set.
* With hive cbo, it may lead to some incorrect query results, such as to query `struct` type with `not null` predicate, you can disable the cbo by `set hive.cbo.enable=false;` command.

## Flink SQL: with Paimon Hive Catalog

By using paimon Hive catalog, you can create, drop, select and insert into paimon tables from Flink. These operations directly affect the corresponding Hive metastore. Tables created in this way can also be accessed directly from Hive.

**Step 1: Prepare Flink Hive Connector Bundled Jar**

See [creating a catalog with Hive metastore]({{< ref "how-to/creating-catalogs#creating-a-catalog-with-hive-metastore" >}}).

**Step 2: Create Test Data with Flink SQL**

Execute the following Flink SQL script in Flink SQL client to define a Paimon Hive catalog and create a table.

```sql
-- Flink SQL CLI
-- Define paimon Hive catalog

CREATE CATALOG my_hive WITH (
'type' = 'paimon',
'metastore' = 'hive',
-- 'uri' = 'thrift://<hive-metastore-host-name>:<port>', default use 'hive.metastore.uris' in HiveConf
-- 'hive-conf-dir' = '...', this is recommended in the kerberos environment
-- 'hadoop-conf-dir' = '...', this is recommended in the kerberos environment
-- 'warehouse' = 'hdfs:///path/to/table/store/warehouse', default use 'hive.metastore.warehouse.dir' in HiveConf
);

-- Use paimon Hive catalog

USE CATALOG my_hive;

-- Create a table in paimon Hive catalog (use "default" database by default)

CREATE TABLE test_table (
a int,
b string
);

-- Insert records into test table

INSERT INTO test_table VALUES (1, 'Table'), (2, 'Store');

-- Read records from test table

SELECT * FROM test_table;

/*
+---+-------+
| a | b |
+---+-------+
| 1 | Table |
| 2 | Store |
+---+-------+
*/
```

## Hive SQL: access Paimon Tables already in Hive metastore

Run the following Hive SQL in Hive CLI to access the created table.
Expand Down Expand Up @@ -165,7 +111,10 @@ OK
*/

-- Insert records into test table
-- Note tez engine does not support hive write, only the hive engine is supported.
-- Limitations:
-- Only support INSERT INTO, not support INSERT OVERWRITE
-- It is recommended to write to a non primary key table
-- Writing to a primary key table may result in a large number of small files

INSERT INTO test_table VALUES (3, 'Paimon');

Expand Down
65 changes: 48 additions & 17 deletions docs/content/engines/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,25 +26,56 @@ under the License.

# Overview

Paimon not only supports Flink SQL writes and queries natively,
but also provides queries from other popular engines, such as
Apache Spark and Apache Hive.

## Compatibility Matrix

| Engine | Version | Batch Read | Batch Write | Create Table | Alter Table | Streaming Write | Streaming Read | Batch Overwrite |
|:-------------------------------------------------------------------------------:|:-------------:|:----------:|:-----------:|:------------:|:-----------:|:---------------:|:--------------:|:---------------:|
| Flink | 1.15 - 1.19 |||| ✅(1.17+) ||||
| Spark | 3.1 - 3.5 |||||| ✅(3.3+) ||
| Hive | 2.1 - 3.1 ||||||||
| Spark | 2.4 ||||||||
| Trino | 422 - 426 ||||||||
| Trino | 427 - 439 ||||||||
| Presto | 0.236 - 0.280 ||||||||
| [StarRocks](https://docs.starrocks.io/docs/data_source/catalog/paimon_catalog/) | 3.1+ ||||||||
| [Doris](https://doris.apache.org/docs/lakehouse/multi-catalog/paimon/) | 2.0+ ||||||||

Recommended versions are Flink 1.17.2, Spark 3.5.0, Hive 2.3.9
| Engine | Version | Batch Read | Batch Write | Create Table | Alter Table | Streaming Write | Streaming Read | Batch Overwrite | DELETE & UPDATE | MERGE INTO |
|:-------------------------------------------------------------------------------:|:-------------:|:-----------:|:------------:|:------------:|:------------:|:----------------:|:--------------:|:----------------:|:----------------:|:-----------:|
| Flink | 1.15 - 1.19 |||| ✅(1.17+) |||| ✅(1.17+) ||
| Spark | 3.1 - 3.5 || ✅(3.3+) ||| ✅(3.3+) | ✅(3.3+) | ✅(3.3+) | ✅(3.2+) | ✅(3.2+) |
| Hive | 2.1 - 3.1 ||||||||||
| Trino | 420 - 426 ||||||||||
| Trino | 427 - 439 ||||||||||
| Presto | 0.236 - 0.280 ||||||||||
| [StarRocks](https://docs.starrocks.io/docs/data_source/catalog/paimon_catalog/) | 3.1+ ||||||||||
| [Doris](https://doris.apache.org/docs/lakehouse/multi-catalog/paimon/) | 2.0+ ||||||||||

## Streaming Engines

### Flink Streaming

Flink is the most comprehensive streaming computing engine that is widely used for data CDC ingestion and the
construction of streaming pipelines.

Recommended version is Flink 1.17.2.

### Spark Streaming

You can also use Spark Streaming to build a streaming pipeline. Spark's schema evolution capability will be better
implemented, but you must accept the mechanism of mini-batch.

## Batch Engines

### Spark Batch

Spark Batch is the most widely used batch computing engine.

Recommended version is Spark 3.4.3.

### Flink Batch

Flink Batch is also available, which can make your pipeline more integrated with streaming and batch unified.

## OLAP Engines

### StarRocks

StarRocks is the most recommended OLAP engine with the most advanced integration.

Recommended version is StarRocks 3.2.6.

### Other OLAP

You can also use Doris and Trino and Presto, or, you can just use Spark, Flink and Hive to query Paimon tables.

## Download

Expand Down
2 changes: 1 addition & 1 deletion docs/content/engines/presto.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Presto"
weight: 5
weight: 6
type: docs
aliases:
- /engines/presto.html
Expand Down
Loading

0 comments on commit c6abd4c

Please sign in to comment.