Skip to content

Commit

Permalink
[docs](partition) Improve auto partitions document (apache#27662)
Browse files Browse the repository at this point in the history
  • Loading branch information
zclllyybb authored Nov 29, 2023
1 parent d9d5468 commit c30299c
Show file tree
Hide file tree
Showing 7 changed files with 149 additions and 121 deletions.
22 changes: 11 additions & 11 deletions docs/en/docs/admin-manual/config/be-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -824,6 +824,16 @@ BaseCompaction:546859:
* Default value: 100
* Dynamically modifiable: Yes
#### `olap_table_sink_send_interval_microseconds`
* Description: While loading data, there's a polling thread keep sending data to corresponding BE from Coordinator's sink node. This thread will check whether there's data to send every `olap_table_sink_send_interval_microseconds` microseconds.
* Default value: 1000
#### `olap_table_sink_send_interval_auto_partition_factor`
* Description: If we load data to a table which enabled auto partition. the interval of `olap_table_sink_send_interval_microseconds` is too slow. In that case the real interval will multiply this factor.
* Default value: 0.001
### Thread
#### `delete_worker_count`
Expand Down Expand Up @@ -1290,7 +1300,7 @@ BaseCompaction:546859:
* Description: The number of threads making schema changes
* Default value: 3
### `alter_index_worker_count`
#### `alter_index_worker_count`
* Description: The number of threads making index change
* Default value: 3
Expand Down Expand Up @@ -1495,13 +1505,3 @@ Indicates how many tablets failed to load in the data directory. At the same tim
* Description: BE Whether to enable the use of java-jni. When enabled, mutual calls between c++ and java are allowed. Currently supports hudi, java-udf, jdbc, max-compute, paimon, preload, avro
* Default value: true
#### `olap_table_sink_send_interval_microseconds`
* Description: While loading data, there's a polling thread keep sending data to corresponding BE from Coordinator's sink node. This thread will check whether there's data to send every `olap_table_sink_send_interval_microseconds` microseconds.
* Default value: 1000
#### `olap_table_sink_send_interval_auto_partition_factor`
* Description: If we load data to a table which enabled auto partition. the interval of `olap_table_sink_send_interval_microseconds` is too slow. In that case the real interval will multiply this factor.
* Default value: 0.001
106 changes: 57 additions & 49 deletions docs/en/docs/advanced/partition/auto-partition.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,58 @@ under the License.

The Auto Partitioning feature supports automatic detection of whether the corresponding partition exists during the data import process. If it does not exist, the partition will be created automatically and imported normally.

## Usage Scenarios

The auto partition function mainly solves the problem that the user expects to partition the table based on a certain column, but the data distribution of the column is scattered or unpredictable, so it is difficult to accurately create the required partitions when building or adjusting the structure of the table, or the number of partitions is so large that it is too cumbersome to create them manually.

Take the time type partition column as an example, in the [Dynamic Partition](./dynamic-partition) function, we support the automatic creation of new partitions to accommodate real-time data at specific time periods. For real-time user behaviour logs and other scenarios, this feature basically meets the requirements. However, in more complex scenarios, such as dealing with non-real-time data, the partition column is independent of the current system time and contains a large number of discrete values. At this time to improve efficiency we want to partition the data based on this column, but the data may actually involve the partition can not be grasped in advance, or the expected number of required partitions is too large. In this case, dynamic partitioning or manually created partitions can not meet our needs, automatic partitioning function is very good to cover such needs.

Suppose our table DDL is as follows:

```sql
CREATE TABLE `DAILY_TRADE_VALUE`
(
`TRADE_DATE` datev2 NULL COMMENT '交易日期',
`TRADE_ID` varchar(40) NULL COMMENT '交易编号',
......
)
UNIQUE KEY(`TRADE_DATE`, `TRADE_ID`)
PARTITION BY RANGE(`TRADE_DATE`)
(
PARTITION p_2000 VALUES [('2000-01-01'), ('2001-01-01')),
PARTITION p_2001 VALUES [('2001-01-01'), ('2002-01-01')),
PARTITION p_2002 VALUES [('2002-01-01'), ('2003-01-01')),
PARTITION p_2003 VALUES [('2003-01-01'), ('2004-01-01')),
PARTITION p_2004 VALUES [('2004-01-01'), ('2005-01-01')),
PARTITION p_2005 VALUES [('2005-01-01'), ('2006-01-01')),
PARTITION p_2006 VALUES [('2006-01-01'), ('2007-01-01')),
PARTITION p_2007 VALUES [('2007-01-01'), ('2008-01-01')),
PARTITION p_2008 VALUES [('2008-01-01'), ('2009-01-01')),
PARTITION p_2009 VALUES [('2009-01-01'), ('2010-01-01')),
PARTITION p_2010 VALUES [('2010-01-01'), ('2011-01-01')),
PARTITION p_2011 VALUES [('2011-01-01'), ('2012-01-01')),
PARTITION p_2012 VALUES [('2012-01-01'), ('2013-01-01')),
PARTITION p_2013 VALUES [('2013-01-01'), ('2014-01-01')),
PARTITION p_2014 VALUES [('2014-01-01'), ('2015-01-01')),
PARTITION p_2015 VALUES [('2015-01-01'), ('2016-01-01')),
PARTITION p_2016 VALUES [('2016-01-01'), ('2017-01-01')),
PARTITION p_2017 VALUES [('2017-01-01'), ('2018-01-01')),
PARTITION p_2018 VALUES [('2018-01-01'), ('2019-01-01')),
PARTITION p_2019 VALUES [('2019-01-01'), ('2020-01-01')),
PARTITION p_2020 VALUES [('2020-01-01'), ('2021-01-01')),
PARTITION p_2021 VALUES [('2021-01-01'), ('2022-01-01'))
)
DISTRIBUTED BY HASH(`TRADE_DATE`) BUCKETS 10
PROPERTIES (
"replication_num" = "1"
);
```

The table stores a large amount of business history data, partitioned based on the date the transaction occurred. As you can see when building the table, we need to manually create the partitions in advance. If the data range of the partitioned columns changes, for example, 2022 is added to the above table, we need to create a partition by [ALTER-TABLE-PARTITION](../../sql-manual/sql-reference/Data-Definition-Statements/Alter/ALTER-TABLE-PARTITION) to make changes to the table partition. If such partitions need to be changed, or subdivided at a finer level of granularity, it is very tedious to modify them. At this point we can rewrite the table DDL using AUTO PARTITION.

## Grammer

When building a table, use the following syntax to populate [CREATE-TABLE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md) with the `partition_info` section:
When building a table, use the following syntax to populate [CREATE-TABLE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE) with the `partition_info` section:

1. AUTO RANGE PARTITION:

Expand Down Expand Up @@ -94,55 +143,12 @@ When building a table, use the following syntax to populate [CREATE-TABLE](../..

1. Currently the AUTO RANGE PARTITION function supports only one partition column;
2. In AUTO RANGE PARTITION, the partition function supports only `date_trunc` and the partition column supports only `DATEV2` or `DATETIMEV2` format;
3. In AUTO LIST PARTITION, function calls are not supported. Partitioned columns support BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, LARGEINT, DATE, DATETIME, CHAR, VARCHAR datatypes, and partitioned values are enum values.
3. In AUTO LIST PARTITION, function calls are not supported. Partitioned columns support `BOOLEAN`, `TINYINT`, `SMALLINT`, `INT`, `BIGINT`, `LARGEINT`, `DATE`, `DATETIME`, `CHAR`, `VARCHAR` datatypes, and partitioned values are enum values.
4. In AUTO LIST PARTITION, a separate new PARTITION is created for each fetch of a partition column for which the corresponding partition does not currently exist.

## Sample Scenarios

In the [Dynamic Partitioning](./dynamic-partition.md) feature, we support the automatic creation of new partitions to accommodate real-time data at specific time periods. However, in more complex scenarios, such as processing non-real-time data, the partition columns are independent of the current system time. In this case, if you need to partition the data, you need to manually organise the partitions you belong to and create them before importing the data. This is cumbersome when the number of partition columns is large. The automatic partition function solves this problem.

For example, we have a table as follows:

```sql
CREATE TABLE `DAILY_TRADE_VALUE`
(
`TRADE_DATE` datev2 NULL
`TRADE_ID` varchar(40) NULL,
......
)
UNIQUE KEY(`TRADE_DATE`, `TRADE_ID`)
PARTITION BY RANGE(`TRADE_DATE`)
(
PARTITION p_2000 VALUES [('2000-01-01'), ('2001-01-01')),
PARTITION p_2001 VALUES [('2001-01-01'), ('2002-01-01')),
PARTITION p_2002 VALUES [('2002-01-01'), ('2003-01-01')),
PARTITION p_2003 VALUES [('2003-01-01'), ('2004-01-01')),
PARTITION p_2004 VALUES [('2004-01-01'), ('2005-01-01')),
PARTITION p_2005 VALUES [('2005-01-01'), ('2006-01-01')),
PARTITION p_2006 VALUES [('2006-01-01'), ('2007-01-01')),
PARTITION p_2007 VALUES [('2007-01-01'), ('2008-01-01')),
PARTITION p_2008 VALUES [('2008-01-01'), ('2009-01-01')),
PARTITION p_2009 VALUES [('2009-01-01'), ('2010-01-01')),
PARTITION p_2010 VALUES [('2010-01-01'), ('2011-01-01')),
PARTITION p_2011 VALUES [('2011-01-01'), ('2012-01-01')),
PARTITION p_2012 VALUES [('2012-01-01'), ('2013-01-01')),
PARTITION p_2013 VALUES [('2013-01-01'), ('2014-01-01')),
PARTITION p_2014 VALUES [('2014-01-01'), ('2015-01-01')),
PARTITION p_2015 VALUES [('2015-01-01'), ('2016-01-01')),
PARTITION p_2016 VALUES [('2016-01-01'), ('2017-01-01')),
PARTITION p_2017 VALUES [('2017-01-01'), ('2018-01-01')),
PARTITION p_2018 VALUES [('2018-01-01'), ('2019-01-01')),
PARTITION p_2019 VALUES [('2019-01-01'), ('2020-01-01')),
PARTITION p_2020 VALUES [('2020-01-01'), ('2021-01-01')),
PARTITION p_2021 VALUES [('2021-01-01'), ('2022-01-01'))
)
DISTRIBUTED BY HASH(`TRADE_DATE`) BUCKETS 10
PROPERTIES (
"replication_num" = "1"
);
```

The table stores a large amount of business history data, partitioned based on the date the transaction occurred. As you can see when building the table, we need to manually create the partitions in advance. If the data range of the partitioned columns changes, for example, 2022 is added to the above table, we need to create a partition by [ALTER-TABLE-PARTITION](../../sql-manual/sql-reference/Data-Definition-Statements/Alter/ALTER-TABLE-PARTITION.md) to make changes to the table partition. After using AUTO PARTITION, the table DDL can be changed to:
In the example in the Usage Scenarios section, the table DDL can be rewritten after using AUTO PARTITION:

```sql
CREATE TABLE `DAILY_TRADE_VALUE`
Expand Down Expand Up @@ -171,7 +177,6 @@ After inserting the data and then viewing it again, we could found that the tabl
```sql
mysql> insert into `DAILY_TRADE_VALUE` values ('2012-12-13', 1), ('2008-02-03', 2), ('2014-11-11', 3);
Query OK, 3 rows affected (0.88 sec)
{'label':'insert_754e2a3926a345ea_854793fb2638f0ec', 'status':'VISIBLE', 'txnId':'20014'}

mysql> show partitions from `DAILY_TRADE_VALUE`;
+-------------+-----------------+----------------+---------------------+--------+--------------+--------------------------------------------------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+
Expand All @@ -184,8 +189,11 @@ mysql> show partitions from `DAILY_TRADE_VALUE`;
3 rows in set (0.12 sec)
```

A partition created by the AUTO PARTITION function has the exact same functional properties as a manually created partition.

## caveat

- If a partition is created during the insertion or importation of data and the process eventually fails, the created partition is not automatically deleted.
- If a partition is created during the insertion or import of data and the entire import process does not complete (fails or is cancelled), the created partition is not automatically deleted.
- Tables that use AUTO PARTITION only have their partitions created automatically instead of manually. The original use of the table and the partitions it creates is the same as for non-AUTO PARTITION tables or partitions.
- When importing data to a table with AUTO PARTITION enabled, the polling interval for data sent by the Coordinator is different from that of a normal table. For details, see `olap_table_sink_send_interval_auto_partition_factor` in [BE Configuration](../../admin-manual/config/be-config.md).
- To prevent accidental creation of too many partitions, we use the [FE Configuration](../../admin-manual/config/fe-config) `max_auto_partition_num` controls the maximum number of partitions an AUTO PARTITION table can hold. This value can be adjusted if necessary
- When importing data to a table with AUTO PARTITION enabled, the polling interval for data sent by the Coordinator is different from that of a normal table. For details, see `olap_table_sink_send_interval_auto_partition_factor` in [BE Configuration](../../admin-manual/config/be-config).
5 changes: 5 additions & 0 deletions docs/en/docs/faq/data-faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,9 +169,14 @@ FROM kafka
"property.group.id" = "xxx"
);
```

### Q12. ERROR 1105 (HY000): errCode = 2, detailMessage = (192.168.90.91)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none

```
yum install -y ca-certificates
ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt
```

### Q13. create partition failed. partition numbers will exceed limit variable max_auto_partition_num

To prevent accidental creation of too many partitions when importing data for auto-partitioned tables, we use the FE configuration item `max_auto_partition_num` to control the maximum number of partitions to be created automatically for such tables. If you really need to create more partitions, please modify this config item of FE Master node.
20 changes: 10 additions & 10 deletions docs/zh-CN/docs/admin-manual/config/be-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -849,6 +849,16 @@ BaseCompaction:546859:
* 默认值: 100
* 可动态修改:是

#### `olap_table_sink_send_interval_microseconds`.

* 描述: 数据导入时,Coordinator 的 sink 节点有一个轮询线程持续向对应BE发送数据。该线程将每隔 `olap_table_sink_send_interval_microseconds` 微秒检查是否有数据要发送。
* 默认值:1000

#### `olap_table_sink_send_interval_auto_partition_factor`.

* 描述: 如果我们向一个启用了自动分区的表导入数据,那么 `olap_table_sink_send_interval_microseconds` 的时间间隔就会太慢。在这种情况下,实际间隔将乘以该系数。
* 默认值:0.001

### 线程

#### `delete_worker_count`
Expand Down Expand Up @@ -1524,13 +1534,3 @@ load tablets from header failed, failed tablets size: xxx, path=xxx

* 描述: BE 是否开启使用java-jni,开启后允许 c++ 与 java 之间的相互调用。目前已经支持hudi、java-udf、jdbc、max-compute、paimon、preload、avro
* 默认值: true

#### `olap_table_sink_send_interval_microseconds`.

* 描述: 数据导入时,Coordinator 的 sink 节点有一个轮询线程持续向对应BE发送数据。该线程将每隔 `olap_table_sink_send_interval_microseconds` 微秒检查是否有数据要发送。
* 默认值:1000

#### `olap_table_sink_send_interval_auto_partition_factor`.

* 描述: 如果我们向一个启用了自动分区的表导入数据,那么 `olap_table_sink_send_interval_microseconds` 的时间间隔就会太慢。在这种情况下,实际间隔将乘以该系数。
* 默认值:0.001
Loading

0 comments on commit c30299c

Please sign in to comment.