From 8e57f8404d86e9bf71e077d633e767488feee7ae Mon Sep 17 00:00:00 2001 From: Hu Yanjun <100749531+httpshirley@users.noreply.github.com> Date: Mon, 1 Apr 2024 21:56:40 +0800 Subject: [PATCH] [docs](doc) update en doc (#503) --- .../data-operate/import/insert-into-manual.md | 487 ++++++++++++------ .../data-operate/import/mysql-load-manual.md | 287 ++++++++--- .../table-design/auto-increment.md | 421 ++++----------- 3 files changed, 624 insertions(+), 571 deletions(-) diff --git a/versioned_docs/version-2.0/data-operate/import/insert-into-manual.md b/versioned_docs/version-2.0/data-operate/import/insert-into-manual.md index f05200ad5b06e..453ed86214f9a 100644 --- a/versioned_docs/version-2.0/data-operate/import/insert-into-manual.md +++ b/versioned_docs/version-2.0/data-operate/import/insert-into-manual.md @@ -24,245 +24,414 @@ specific language governing permissions and limitations under the License. --> -# Insert Into +The INSERT INTO statement supports importing the results of a Doris query into another table. INSERT INTO is a synchronous import method, where the import result is returned after the import is executed. Whether the import is successful can be determined based on the returned result. INSERT INTO ensures the atomicity of the import task, meaning that either all the data is imported successfully or none of it is imported. -The use of Insert Into statements is similar to that of Insert Into statements in databases such as MySQL. But in Doris, all data writing is a separate import job. So Insert Into is also introduced here as an import method. +There are primarily two main forms of the INSERT INTO command: -The main Insert Into command contains the following two kinds; +- INSERT INTO tbl SELECT... +- INSERT INTO tbl (col1, col2, ...) VALUES (1, 2, ...), (1,3, ...) -- INSERT INTO tbl SELECT ... -- INSERT INTO tbl (col1, col2, ...) VALUES (1, 2, ...), (1,3, ...); +## Applicable scenarios -The second command is for Demo only, not in a test or production environment. +1. If a user wants to import only a few test data records to verify the functionality of the Doris system, the INSERT INTO VALUES syntax is applicable. It is similar to the MySQL syntax. However, it is not recommended to use INSERT INTO VALUES in a production environment. +2. If a user wants to perform ETL on existing data in a Doris table and then import it into a new Doris table, the INSERT INTO SELECT syntax is applicable. +3. In conjunction with the Multi-Catalog external table mechanism, tables from MySQL or Hive systems can be mapped via Multi-Catalog. Then, data from external tables can be imported into Doris tables using the INSERT INTO SELECT syntax. +4. Utilizing the Table Value Functions (TVFs), users can directly query data stored in object storage or files on HDFS as tables, with automatic column type inference. Then, data from external tables can be imported into Doris tables using the INSERT INTO SELECT syntax. -## Import operations and load results +## Implementation -The Insert Into command needs to be submitted through MySQL protocol. Creating an import request returns the import result synchronously. +When using INSERT INTO, the import job needs to be initiated and submitted to the FE node using the MySQL protocol. The FE generates an execution plan, which includes query-related operators, with the last operator being the OlapTableSink. The OlapTableSink operator is responsible for writing the query result to the target table. The execution plan is then sent to the BE nodes for execution. Doris designates one BE node as the Coordinator, which receives and distributes the data to other BE nodes. -The following are examples of the use of two Insert Intos: +## Get started -```sql -INSERT INTO tbl2 WITH LABEL label1 SELECT * FROM tbl3; -INSERT INTO tbl1 VALUES ("qweasdzxcqweasdzxc"), ("a"); +An INSERT INTO job is submitted and transmitted using the MySQL protocol. The following example demonstrates submitting an import job using INSERT INTO through the MySQL command-line interface. + +Detailed syntax can be found in the INSERT INTO documentation. + +### Preparation + +INSERT INTO requires INSERT permissions on the target table. You can grant permissions to user accounts using the GRANT command. + +### Create an INSERT INTO job + +**INSERT INTO VALUES** + +1. Create a source table + +```SQL +CREATE TABLE testdb.test_table( + user_id BIGINT NOT NULL COMMENT "User ID", + name VARCHAR(20) COMMENT "User name", + age INT COMMENT "User age" +) +DUPLICATE KEY(user_id) +DISTRIBUTED BY HASH(user_id) BUCKETS 10; +``` + +2. Import data into the source table using `INSERT INTO VALUES` (not recommended for production environments). + +```SQL +INSERT INTO testdb.test_table (user_id, name, age) +VALUES (1, "Emily", 25), + (2, "Benjamin", 35), + (3, "Olivia", 28), + (4, "Alexander", 60), + (5, "Ava", 17); +``` + +INSERT INTO is a synchronous import method, where the import result is directly returned to the user. + +```JSON +Query OK, 5 rows affected (0.308 sec) +{'label':'label_3e52da787aab4222_9126d2fce8f6d1e5', 'status':'VISIBLE', 'txnId':'9081'} +``` + +3. View imported data. + +```SQL +MySQL> SELECT COUNT(*) FROM testdb.test_table; ++----------+ +| count(*) | ++----------+ +| 5 | ++----------+ +1 row in set (0.179 sec) ``` -> Note: When you need to use `CTE(Common Table Expressions)` as the query part in an insert operation, you must specify the `WITH LABEL` and column list parts or wrap `CTE`. Example: -> -> ```sql -> INSERT INTO tbl1 WITH LABEL label1 -> WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2) -> SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1; -> -> -> INSERT INTO tbl1 (k1) -> WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2) -> SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1; -> -> INSERT INTO tbl1 (k1) -> select * from ( -> WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2) -> SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1) as ret -> ``` +**INSERT INTO SELECT** -For specific parameter description, you can refer to [INSERT INTO](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/INSERT.md) command or execute `HELP INSERT ` to see its help documentation for better use of this import method. +1. Building upon the above operations, create a new table as the target table (with the same schema as the source table). +```SQL +CREATE TABLE testdb.test_table2 LIKE testdb.test_table; +``` -Insert Into itself is a SQL command, and the return result is divided into the following types according to the different execution results: +2. Ingest data into the new table using `INSERT INTO SELECT`. -1. Result set is empty +```SQL +INSERT INTO testdb.test_table2 +SELECT * FROM testdb.test_table WHERE age < 30; +Query OK, 3 rows affected (0.544 sec) +{'label':'label_9c2bae970023407d_b2c5b78b368e78a7', 'status':'VISIBLE', 'txnId':'9084'} +``` + +3. View imported data. + +```SQL +MySQL> SELECT COUNT(*) FROM testdb.test_table2; ++----------+ +| count(*) | ++----------+ +| 3 | ++----------+ +1 row in set (0.071 sec) +``` - If the result set of the insert corresponding SELECT statement is empty, it is returned as follows: +### View INSERT INTO jobs - ```text - mysql> insert into tbl1 select * from empty_tbl; - Query OK, 0 rows affected (0.02 sec) - ``` +You can use the `SHOW LOAD` command to view the completed INSERT INTO tasks. - `Query OK` indicates successful execution. `0 rows affected` means that no data was loaded. +```SQL +MySQL> SHOW LOAD FROM testdb; ++--------+-----------------------------------------+----------+--------------------+--------+---------+----------------------------------------------------------------------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+------+-----------------------------------------------------------------------------------------------------------------------+---------------+--------------+------+---------+ +| JobId | Label | State | Progress | Type | EtlInfo | TaskInfo | ErrorMsg | CreateTime | EtlStartTime | EtlFinishTime | LoadStartTime | LoadFinishTime | URL | JobDetails | TransactionId | ErrorTablets | User | Comment | ++--------+-----------------------------------------+----------+--------------------+--------+---------+----------------------------------------------------------------------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+------+-----------------------------------------------------------------------------------------------------------------------+---------------+--------------+------+---------+ +| 376416 | label_3e52da787aab4222_9126d2fce8f6d1e5 | FINISHED | Unknown id: 376416 | INSERT | NULL | cluster:N/A; timeout(s):26200; max_filter_ratio:0.0; priority:NORMAL | NULL | 2024-02-27 01:22:17 | 2024-02-27 01:22:17 | 2024-02-27 01:22:17 | 2024-02-27 01:22:17 | 2024-02-27 01:22:18 | | {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0} | 9081 | {} | root | | +| 376664 | label_9c2bae970023407d_b2c5b78b368e78a7 | FINISHED | Unknown id: 376664 | INSERT | NULL | cluster:N/A; timeout(s):26200; max_filter_ratio:0.0; priority:NORMAL | NULL | 2024-02-27 01:39:37 | 2024-02-27 01:39:37 | 2024-02-27 01:39:37 | 2024-02-27 01:39:37 | 2024-02-27 01:39:38 | | {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0} | 9084 | {} | root | | ++--------+-----------------------------------------+----------+--------------------+--------+---------+----------------------------------------------------------------------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+------+-----------------------------------------------------------------------------------------------------------------------+---------------+--------------+------+---------+ +``` -2. The result set is not empty +### Cancel INSERT INTO jobs - In the case where the result set is not empty. The returned results are divided into the following situations: +You can cancel the currently executing INSERT INTO job via Ctrl-C. - 1. Insert is successful and data is visible: +## Manual - ```text - mysql> insert into tbl1 select * from tbl2; - Query OK, 4 rows affected (0.38 sec) - {'label': 'insert_8510c568-9eda-4173-9e36-6adc7d35291c', 'status': 'visible', 'txnId': '4005'} - - mysql> insert into tbl1 with label my_label1 select * from tbl2; - Query OK, 4 rows affected (0.38 sec) - {'label': 'my_label1', 'status': 'visible', 'txnId': '4005'} - - mysql> insert into tbl1 select * from tbl2; - Query OK, 2 rows affected, 2 warnings (0.31 sec) - {'label': 'insert_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status': 'visible', 'txnId': '4005'} - - mysql> insert into tbl1 select * from tbl2; - Query OK, 2 rows affected, 2 warnings (0.31 sec) - {'label': 'insert_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status': 'committed', 'txnId': '4005'} - ``` +### Syntax - `Query OK` indicates successful execution. `4 rows affected` means that a total of 4 rows of data were imported. `2 warnings` indicates the number of lines to be filtered. +The syntax of INSERT INTO is as follows: - Also returns a json string: +1. INSERT INTO SELECT - ```text - {'label': 'my_label1', 'status': 'visible', 'txnId': '4005'} - {'label': 'insert_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status': 'committed', 'txnId': '4005'} - {'label': 'my_label1', 'status': 'visible', 'txnId': '4005', 'err': 'some other error'} - ``` +INSERT INTO SELECT is used to write the query results to the target table. - `label` is a user-specified label or an automatically generated label. Label is the ID of this Insert Into load job. Each load job has a label that is unique within a single database. +```SQL +INSERT INTO target_table SELECT ... FROM source_table; +``` - `status` indicates whether the loaded data is visible. If visible, show `visible`, if not, show`committed`. +The SELECT statement above is similar to a regular SELECT query, allowing operations such as WHERE and JOIN. - `txnId` is the id of the load transaction corresponding to this insert. +2. INSERT INTO VALUES - The `err` field displays some other unexpected errors. +INSERT INTO VALUES is typically used for testing purposes. It is not recommended for production environments. - When user need to view the filtered rows, the user can use the following statement +```SQL +INSERT INTO target_table (col1, col2, ...) +VALUES (val1, val2, ...), (val3, val4, ...), ...; +``` - ```text - show load where label = "xxx"; - ``` +### Parameter configuration - The URL in the returned result can be used to query the wrong data. For details, see the following **View Error Lines** Summary. **"Data is not visible" is a temporary status, this batch of data must be visible eventually** +**FE** **configuration** - You can view the visible status of this batch of data with the following statement: +insert_load_default_timeout_second - ```text - show transaction where id = 4005; - ``` +- Default value: 14400s (4 hours) +- Description: Timeout for import tasks, measured in seconds. If the import task does not complete within this timeout period, it will be canceled by the system and marked as CANCELLED. - If the `TransactionStatus` column in the returned result is `visible`, the data is visible. +**Environment parameters** - 2. Insert fails +insert_timeout - Execution failure indicates that no data was successfully loaded, and returns as follows: +- Default value: 14400s (4 hours) +- Description: Timeout for INSERT INTO as an SQL statement, measured in seconds. - ```text - mysql> insert into tbl1 select * from tbl2 where k1 = "a"; - ERROR 1064 (HY000): all partitions have no load data. Url: http://10.74.167.16:8042/api/_load_error_log?file=__shard_2/error_log_insert_stmt_ba8bb9e158e4879-ae8de8507c0bf8a2_ba8bb9e158e4879_ae8de850e8de850 - ``` +enable_insert_strict - Where `ERROR 1064 (HY000): all partitions have no load data` shows the reason for the failure. The latter url can be used to query the wrong data. For details, see the following **View Error Lines** Summary. +- Default value: true +- Description: If this is set to true, INSERT INTO will fail when the task involves invalid data. If set to false, INSERT INTO will ignore invalid rows, and the import will be considered successful as long as at least one row is imported successfully. +- Explanation: INSERT INTO cannot control the error rate, so this parameter is used to either strictly check data quality or completely ignore invalid data. Common reasons for data invalidity include: source data column length exceeding destination column length, column type mismatch, partition mismatch, and column order mismatch. -**In summary, the correct processing logic for the results returned by the insert operation should be:** +### Return values -1. If the returned result is `ERROR 1064 (HY000)`, it means that the import failed. -2. If the returned result is `Query OK`, it means the execution was successful. - 1. If `rows affected` is 0, the result set is empty and no data is loaded. - 2. If`rows affected` is greater than 0: - 1. If `status` is`committed`, the data is not yet visible. You need to check the status through the `show transaction` statement until `visible`. - 2. If `status` is`visible`, the data is loaded successfully. - 3. If `warnings` is greater than 0, it means that some data is filtered. You can get the url through the `show load` statement to see the filtered rows. +INSERT INTO an SQL statement, and it returns different results based on different query outcomes: -In the previous section, we described how to follow up on the results of insert operations. However, it is difficult to get the json string of the returned result in some mysql libraries. Therefore, Doris also provides the `SHOW LAST INSERT` command to explicitly retrieve the results of the last insert operation. +**Empty result set** -After executing an insert operation, you can execute `SHOW LAST INSERT` on the same session connection. This command returns the result of the most recent insert operation, e.g. +If the query result set of the SELECT statement in INSERT INTO is empty, the returned value will be as follows: -```sql -mysql> show last insert\G -*************************** 1. row *************************** - TransactionId: 64067 - Label: insert_ba8f33aea9544866-8ed77e2844d0cc9b - Database: default_cluster:db1 - Table: t1 -TransactionStatus: VISIBLE - LoadedRows: 2 - FilteredRows: 0 +```SQL +mysql> INSERT INTO tbl1 SELECT * FROM empty_tbl; +Query OK, 0 rows affected (0.02 sec) ``` -This command returns the insert results and the details of the corresponding transaction. Therefore, you can continue to execute the `show last insert` command after each insert operation to get the insert results. +`Query OK` indicates successful execution. `0 rows affected` means no data was imported. -> Note: This command will only return the results of the last insert operation within the same session connection. If the connection is broken or replaced with a new one, the empty set will be returned. +**Non-empty result set and successful INSERT** -## Relevant System Configuration +```SQL +mysql> INSERT INTO tbl1 SELECT * FROM tbl2; +Query OK, 4 rows affected (0.38 sec) +{'label':'INSERT_8510c568-9eda-4173-9e36-6adc7d35291c', 'status':'visible', 'txnId':'4005'} -### FE configuration +mysql> INSERT INTO tbl1 WITH LABEL my_label1 SELECT * FROM tbl2; +Query OK, 4 rows affected (0.38 sec) +{'label':'my_label1', 'status':'visible', 'txnId':'4005'} -- time out +mysql> INSERT INTO tbl1 SELECT * FROM tbl2; +Query OK, 2 rows affected, 2 warnings (0.31 sec) +{'label':'INSERT_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':'visible', 'txnId':'4005'} - The timeout time of the import task (in seconds) will be cancelled by the system if the import task is not completed within the set timeout time, and will become CANCELLED. +mysql> INSERT INTO tbl1 SELECT * FROM tbl2; +Query OK, 2 rows affected, 2 warnings (0.31 sec) +{'label':'INSERT_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':'committed', 'txnId':'4005'} +``` - At present, Insert Into does not support custom import timeout time. All Insert Into imports have a uniform timeout time. The default timeout time is 4 hours. If the imported source file cannot complete the import within the specified time, the parameter `insert_load_default_timeout_second` of FE needs to be adjusted. +`Query OK` indicates successful execution. `4 rows affected` indicates that a total of 4 rows of data have been imported. `2 warnings` indicates the number of rows that were filtered out. - - At the same time, the Insert Into statement receives the restriction of the Session variable `insert_timeout`. You can increase the timeout time by `SET insert_timeout = xxx;` in seconds. +Additionally, a JSON string is returned: -### Session Variables +```Plain +{'label':'my_label1', 'status':'visible', 'txnId':'4005'} +{'label':'INSERT_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':'committed', 'txnId':'4005'} +{'label':'my_label1', 'status':'visible', 'txnId':'4005', 'err':'some other error'} +``` -- enable_insert_strict +Parameter description: - The Insert Into import itself cannot control the tolerable error rate of the import. Users can only use the Session parameter `enable_insert_strict`. When this parameter is set to false, it indicates that at least one data has been imported correctly, and then it returns successfully. When this parameter is set to true, the import fails if there is a data error. The default is false. It can be set by `SET enable_insert_strict = true;`. +| Parameter | Description | +| --------- | ------------------------------------------------------------ | +| TxnId | ID of the import transaction | +| Label | Label of the import job: can be specified using "INSERT INTO tbl WITH LABEL label..." | +| Status | Visibility of the imported data: If it is visible, it will be displayed as "visible." If not, it will be displayed as "committed." In the "committed" state, the import is completed, but the data may be delayed in becoming visible. There is no need to retry in this case.`visible`: The import is successful and the data is visible.`committed`: The import is completed, but the data may be delayed in becoming visible. There is no need to retry in this case.Label Already Exists: The specified label already exists and needs to be changed to a different one.Fail: The import fails. | +| Err | Error message | -- query_timeout +You can use the [SHOW LOAD](https://doris.apache.org/docs/sql-manual/sql-reference/Show-Statements/SHOW-LOAD/) statement to view the filtered rows. - Insert Into itself is also an SQL command, and the Insert Into statement is restricted by the Session variable `insert_timeout`. You can increase the timeout time by `SET insert_timeout = xxx;` in seconds. +```SQL +SHOW LOAD WHERE label="xxx"; +``` -- enable_nereids_dml_with_pipeline +The result of this statement will include a URL that can be used to query the error data. For more details, refer to the "View error rows" section below. - When set to `true`, the `insert into` statement will attempt to be executed through the Pipeline engine. See the [import](./load-manual) documentation for details. +The invisible state of data is temporary, and the data will eventually become visible. -## Best Practices +You can check the visibility status of a batch of data using the [SHOW TRANSACTION](https://doris.apache.org/docs/sql-manual/sql-reference/Show-Statements/SHOW-TRANSACTION/) statement. -### Application scenarios +```SQL +SHOW TRANSACTION WHERE id=4005; +``` -1. Users want to import only a few false data to verify the functionality of Doris system. The grammar of INSERT INTO VALUES is suitable at this time. -2. Users want to convert the data already in the Doris table into ETL and import it into a new Doris table, which is suitable for using INSERT INTO SELECT grammar. -3. Users can create an external table, such as MySQL external table mapping a table in MySQL system. Or create Broker external tables to map data files on HDFS. Then the data from the external table is imported into the Doris table for storage through the INSERT INTO SELECT grammar. +If the `TransactionStatus` column in the result is `visible`, it indicates that the data is visible. -### Data volume +```SQL +{'label':'my_label1', 'status':'visible', 'txnId':'4005'} +{'label':'INSERT_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status':'committed', 'txnId':'4005'} +{'label':'my_label1', 'status':'visible', 'txnId':'4005', 'err':'some other error'} +``` + +**Non-empty result set but failed INSERT** -Insert Into has no limitation on the amount of data, and large data imports can also be supported. However, Insert Into has a default timeout time, and the amount of imported data estimated by users is too large, so it is necessary to modify the system's Insert Into import timeout time. +Failed execution means that no data was successfully imported. An error message will be returned: -```text -Import data volume = 36G or less than 3600s*10M/s -Among them, 10M/s is the maximum import speed limit. Users need to calculate the average import speed according to the current cluster situation to replace 10M/s in the formula. +```SQL +mysql> INSERT INTO tbl1 SELECT * FROM tbl2 WHERE k1 = "a"; +ERROR 1064 (HY000): all partitions have no load data. url: http://10.74.167.16:8042/api/_load_error_log?file=_shard_2/error_loginsert_stmt_ba8bb9e158e4879-ae8de8507c0bf8a2_ba8bb9e158e4879_ae8de8507c0bf8a2 ``` -### Complete examples +`ERROR 1064 (HY000): all partitions have no load data` indicates the root cause for the failure. The URL provided in the error message can be used to locate the error data. For more details, refer to the "View error rows" section below. + +## Best practice + +### Data size + +INSERT INTO imposes no limitations on data volume and can support large-scale data imports. However, if you are importing a large amount of data, it is recommended to adjust the system's INSERT INTO timeout settings to ensure that `import timeout >= data volume ``/`` estimated import speed`. + +1. FE configuration parameter `insert_load_default_timeout_second` +2. Environment parameter `insert_timeout` + +### View error rows -Users have a table store sales in the database sales. Users create a table called bj store sales in the database sales. Users want to import the data recorded in the store sales into the new table bj store sales. The amount of data imported is about 10G. +When the INSERT INTO result includes a URL field, you can use the following command to view the error rows: -```text -large sales scheme -(id, total, user_id, sale_timestamp, region) +```SQL +SHOW LOAD WARNINGS ON "url"; +``` + +Example: -Order large sales schedule: -(id, total, user_id, sale_timestamp) +```SQL +SHOW LOAD WARNINGS ON "http://ip:port/api/_load_error_log?file=_shard_13/error_loginsert_stmt_d2cac0a0a16d482d-9041c949a4b71605_d2cac0a0a16d482d_9041c949a4b71605"; ``` -Cluster situation: The average import speed of current user cluster is about 5M/s +Common reasons for errors include: source data column length exceeding destination column length, column type mismatch, partition mismatch, and column order mismatch. + +You can control whether INSERT INTO ignores error rows by configuring the environment variable `enable_insert_strict`. + +## Ingest external data via Multi-Catalog -- Step1: Determine whether you want to modify the default timeout of Insert Into +Doris supports the creation of external tables. Once created, data from external tables can be imported into Doris internal tables using `INSERT INTO SELECT`, or queried directly using SELECT statements. - ```text - Calculate the approximate time of import - 10G / 5M /s = 2000s - - Modify FE configuration - insert_load_default_timeout_second = 2000 - ``` +With its Multi-Catalog feature, Doris supports connections to various mainstream data lakes and databases including Apache Hive, Apache Iceberg, Apache Hudi, Apache Paimon (Incubating), Elasticsearch, MySQL, Oracle, and SQL Server. -- Step2: Create Import Tasks +For more information on Multi-Catalog, please refer to [Lakehouse overview](https://doris.apache.org/docs/2.0/lakehouse/lakehouse-overview/#multi-catalog). - Since users want to ETL data from a table and import it into the target table, they should use the Insert in query\stmt mode to import it. +The followings illustrate importing data from a Hive external table into a Doris internal table. - ```text - INSERT INTO bj_store_sales SELECT id, total, user_id, sale_timestamp FROM store_sales where region = "bj"; - ``` +### Create Hive Catalog -## Common Questions +```SQL +CREATE CATALOG hive PROPERTIES ( + 'type'='hms', + 'hive.metastore.uris' = 'thrift://172.0.0.1:9083', + 'hadoop.username' = 'hive', + 'dfs.nameservices'='your-nameservice', + 'dfs.ha.namenodes.your-nameservice'='nn1,nn2', + 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:8088', + 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:8088', + 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider' +); +``` + +### Ingest data + +1. Create a target table for the data import in Doris. + +```SQL +CREATE TABLE `target_tbl` ( + `k1` decimal(9, 3) NOT NULL COMMENT "", + `k2` char(10) NOT NULL COMMENT "", + `k3` datetime NOT NULL COMMENT "", + `k5` varchar(20) NOT NULL COMMENT "", + `k6` double NOT NULL COMMENT "" +) +COMMENT "Doris Table" +DISTRIBUTED BY HASH(k1) BUCKETS 2 +PROPERTIES ( + "replication_num" = "1" +); +``` -- View the wrong line +2. For detailed instructions on creating Doris tables, please refer to [CREATE TABLE](https://doris.apache.org/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE/). + +3. Importing data (from the `hive.db1.source_tbl` table into the `target_tbl` table). + +```SQL +INSERT INTO target_tbl SELECT k1,k2,k3 FROM hive.db1.source_tbl limit 100; +``` + +The INSERT command is a synchronous command. If it returns a result, that indicates successful import. + +### Notes + +- Ensure that the external data source and the Doris cluster can communicate, including mutual network accessibility between BE nodes and external data sources. + +## Ingest data by TVF + +Doris can directly query and analyze files stored in object storage or HDFS as tables through the Table Value Functions (TVFs), which supports automatic column type inference. For detailed information, please refer to the Lakehouse/TVF documentation. + +### Automatic column type inference + +```Plain +DESC FUNCTION s3 ( + "URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet", + "s3.access_key"= "ak", + "s3.secret_key" = "sk", + "format" = "parquet", + "use_path_style"="true" +); ++---------------+--------------+------+-------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++---------------+--------------+------+-------+---------+-------+ +| p_partkey | INT | Yes | false | NULL | NONE | +| p_name | TEXT | Yes | false | NULL | NONE | +| p_mfgr | TEXT | Yes | false | NULL | NONE | +| p_brand | TEXT | Yes | false | NULL | NONE | +| p_type | TEXT | Yes | false | NULL | NONE | +| p_size | INT | Yes | false | NULL | NONE | +| p_container | TEXT | Yes | false | NULL | NONE | +| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE | +| p_comment | TEXT | Yes | false | NULL | NONE | ++---------------+--------------+------+-------+---------+-------+ +``` + +In this example of an S3 TVF, the file path, connection information, and authentication information are specified. + +You can use the `DESC FUNCTION` syntax to view the schema of this file. + +It can be seen that for Parquet files, Doris automatically infers column types based on the metadata within the file. + +Currently, Doris supports analysis and column type inference for Parquet, ORC, CSV, and JSON formats. + +It can be used in combination with the `INSERT INTO SELECT` syntax to quickly import files into Doris tables for faster analysis. + +```Plain +// 1. Create Doris internal table +CREATE TABLE IF NOT EXISTS test_table +( + id int, + name varchar(50), + age int +) +DISTRIBUTED BY HASH(id) BUCKETS 4 +PROPERTIES("replication_num" = "1"); + +// 2. Insert data by S3 Table Value Function +INSERT INTO test_table (id,name,age) +SELECT cast(id as INT) as id, name, cast (age as INT) as age +FROM s3( + "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", + "s3.access_key"= "ak", + "s3.secret_key" = "sk", + "format" = "parquet", + "use_path_style" = "true"); +``` - Because Insert Into can't control the error rate, it can only tolerate or ignore the error data completely by `enable_insert_strict`. So if `enable_insert_strict` is set to true, Insert Into may fail. If `enable_insert_strict` is set to false, then only some qualified data may be imported. However, in either case, Doris is currently unable to provide the ability to view substandard data rows. Therefore, the user cannot view the specific import error through the Insert Into statement. +### Notes - The causes of errors are usually: source data column length exceeds destination data column length, column type mismatch, partition mismatch, column order mismatch, etc. When it's still impossible to check for problems. At present, it is only recommended that the SELECT command in the Insert Into statement be run to export the data to a file, and then import the file through Stream load to see the specific errors. +- If the URI specified in the `S3 / hdfs` TVF does not match any files, or if all matched files are empty, the `S3 / hdfs` TVF will return an empty result set. In such cases, if you use `DESC FUNCTION` to view the schema of the file, you will see a dummy column `__dummy_col`, which can be ignored. +- If the format specified for the TVF is CSV and the file being read is not empty but the first line of the file is empty, an error will be prompted: `The first line is empty, can not parse column numbers`. This is because the schema cannot be parsed from the first line of the file. -## more help +## More help -For more detailed syntax and best practices used by insert into, see [insert](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/INSERT.md) command manual, you can also enter `HELP INSERT` in the MySql client command line for more help information. +For more detailed syntax on INSERT INTO, refer to the [INSERT INTO](https://doris.apache.org/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/INSERT/) command manual. You can also type `HELP INSERT` at the MySQL client command line for further information. diff --git a/versioned_docs/version-2.0/data-operate/import/mysql-load-manual.md b/versioned_docs/version-2.0/data-operate/import/mysql-load-manual.md index 06592878c8954..a43a1640f4175 100644 --- a/versioned_docs/version-2.0/data-operate/import/mysql-load-manual.md +++ b/versioned_docs/version-2.0/data-operate/import/mysql-load-manual.md @@ -24,129 +24,246 @@ specific language governing permissions and limitations under the License. --> -# Mysql Load - +Apache Doris is compatible with the MySQL protocol and supports the use of the standard MySQL [LOAD DATA](https://dev.mysql.com/doc/refman/8.0/en/load-data.html) syntax to import local files. MySQL Load is a synchronous import method where the import result is returned upon completion. That means users can tell whether the import suceeds from the returned result. Generally, the MySQL Load method can be used to ingest files smaller than 10GB in size. For files larger than 10GB, it is recommended to split them into smaller ones. MySQL Load ensures the atomicity of a batch of import tasks, meaning that either all imports succeed or all imports fail. -This is an stand syntax of MySql [LOAD DATA](https://dev.mysql.com/doc/refman/8.0/en/load-data.html) for user to load local file. +## Applicable scenarios -MySql load synchronously executes the import and returns the import result. The return information will show whether the import is successful for user. +**Supported format** -MySql load is mainly suitable for importing local files on the client side, or importing data from a data stream through a program. +MySQL Load is primarily designed for importing CSV files from the client's local machine or importing data from data streams through programs. - +**Restrictions** -## Basic Principles +When importing CSV files, it is important to differentiate between null values and empty strings (''): -The MySql Load are similar with Stream Load. Both import local files into the Doris cluster, so the MySQL Load will reuses StreamLoad: +- Null values are represented by the escape sequence \N. For example, in `a,\N,b`, the middle column represents a null value. +- Empty strings are represented directly as empty, such as `a, ,b`, where the middle column represents an empty string. - 1. FE receives the MySQL Load request executed by the client and then analyse the SQL +## Implementation - 2. FE build the MySql Load request as a StreamLoad request. +MySQL Load is similar to Stream Load in terms of functionality. They both involve importing local files into the Doris cluster. Therefore, the implementation of MySQL Load reuses the basic import capabilities of Stream Load. - 3. FE selects a BE node to send a StreamLoad request +The main processes of MySQL Load include: - 4. When sending the request, FE will read the local file data from the MySQL client side streamingly, and send it to the HTTP request of StreamLoad asynchronously. +1. The user submits a LOAD DATA request to the frontend (FE), which performs the parsing and encapsulates the request into a Stream Load task. +2. The FE selects a backend (BE) node and sends the Stream Load request to it. +3. Meanwhile, the FE reads the local file data from the MySQL client in an asynchronous and streaming manner and sends it in real time to the HTTP request of the Stream Load. +4. Once the data transfer from the MySQL client is complete, the FE waits for the Stream Load to finish and displays the import result (success or failure) to the client. - 5. After the data transfer on the MySQL client side is completed, FE waits for the StreamLoad to complete, and displays the import success or failure information to the client side. +## Get started +### Preparations -## Support data format +MySQL Load requires INSERT permission on the target table. You can grant permissions to user account using the GRANT command. -MySql Load currently only supports data formats: CSV (text). +### Create a MySQL Load job -## Basic operations +1. Prepare the test data -### client connection -```bash -mysql --local-infile -h 127.0.0.1 -P 9030 -u root -D testdb +Create a data file `client_local.csv` containing the following sample data: + +```SQL +1,10 +2,20 +3,30 +4,40 +5,50 +6,60 ``` -Notice that if you wants to use mysql load, you must connect doris server with `--local-infile` in client command. -If you're use jdbc to connect doris, you must add property named `allowLoadLocalInfile=true` in jdbc url. +2. Connect to MySQL client +Connect to the MySQL client before executing the LOAD DATA command: -### Create test table -```sql - CREATE TABLE testdb.t1 (pk INT, v1 INT SUM) AGGREGATE KEY (pk) DISTRIBUTED BY hash (pk) PROPERTIES ('replication_num' = '1'); - ``` - ### import file from client node - Suppose there is a CSV file named 'client_local.csv 'on the current path of the client side, which will be imported into the test table'testdb.t1' using the MySQL LOAD syntax. +```Shell +mysql --local-infile -h -P -u root -D testdb +``` -```sql -LOAD DATA LOCAL -INFILE 'client_local.csv ' -INTO TABLE testdb.t1 -PARTITION (partition_a, partition_b, partition_c, partition_d) -COLUMNS TERMINATED BY '\ t' -LINES TERMINATED BY '\ n' -IGNORE 1 LINES -(K1, k2, v2, v10, v11) -SET (c1 = k1, c2 = k2, c3 = v10, c4 = v11) -PROPERTIES ("strict_mode" = "true") +:::caution +Specific parameter options need to be used during the connection: + +1. When connecting to the MySQL client, `--local-infile` must be included, otherwise an error might be thrown. +2. When connecting via JDBC, `allowLoadLocalInfile=true` must be specified in the URL. +::: + +3. Create the test table + +Create a table as follows in Doris: + +```SQL +CREATE TABLE testdb.t1 ( + pk INT, + v1 INT SUM +) AGGREGATE KEY (pk) +DISTRIBUTED BY hash (pk); ``` -1. MySQL Load starts with the syntax `LOAD DATA`, and specifying `LOCAL` means reading client side files. -2. The local fill path will be filled after `INFILE`, which can be a relative path or an absolute path. Currently only a single file is supported, and multiple files are not supported -3. The table name after `INTO TABLE` can specify the database name, as shown in the case. It can also be omitted, and the current database for the user will be used. -4. `PARTITION` syntax supports specified partition to import -5. `COLUMNS TERMINATED BY` specifies the column separator -6. `LINES TERMINATED BY` specifies the line separator -7. `IGNORE num LINES` skips the num header of the CSV. -8. Column mapping syntax, see the column mapping chapter of [Imported Data Transformation](../import-scenes/load-data-convert.md) for specific parameters -9. `PROPERTIES` is the configuration of import, please refer to the [MySQL Load](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/MYSQL-LOAD.md) command manual for specific properties. - -### import file from fe server node -Assuming that the '/root/server_local.csv' path on the FE node is a CSV file, use the MySQL client side to connect to the corresponding FE node, and then execute the following command to import data into the test table. - -```sql -LOAD DATA -INFILE '/root/server_local.csv' + +4. Run the LOAD DATA command + +After connecting to the MySQL client, create a Load job. The command is as follows: + +```SQL +LOAD DATA LOCAL +INFILE 'client_local.csv' INTO TABLE testdb.t1 -PARTITION (partition_a, partition_b, partition_c, partition_d) -COLUMNS TERMINATED BY '\ t' -LINES TERMINATED BY '\ n' -IGNORE 1 LINES -(K1, k2, v2, v10, v11) -SET (c1 = k1, c2 = k2, c3 = v10, c4 = v11) -PROPERTIES ("strict_mode" = "true") +COLUMNS TERMINATED BY ',' +LINES TERMINATED BY '\n'; ``` -1. The only difference between the syntax of importing server level local files and importing client side syntax is whether the'LOCAL 'keyword is added after the'LOAD DATA' keyword. -2. FE will have multi-nodes, and importing server level files can only import FE nodes connected by the client side, and cannot import files local to other FE nodes. -3. Server side load was disabled by default. Enable it by setting `mysql_load_server_secure_path` with a secure path. All the load file should be under this path. -### Return result -Since MySQL load is a synchronous import method, the imported results are returned to the user through SQL syntax. -If the import fails, a specific error message will be displayed. If the import is successful, the number of imported rows will be displayed. +### View results -```Text -Query OK, 1 row affected (0.17 sec) -Records: 1 Deleted: 0 Skipped: 0 Warnings: 0 +MySQL Load is a synchronous import method, and the results of the import are returned to the user in the command-line interface. If the import execution fails, it will display specific error messages. + +Below is an example of a successful import result, which returns the number of imported rows: + +```SQL +Query OK, 6 row affected (0.17 sec) +Records: 6 Deleted: 0 Skipped: 0 Warnings: 0 ``` -### Error result -If mysql load process goes wrong, it will show the error in the client as below: -```text +When there are exceptions during the import, the corresponding error will be displayed on the client: + +```SQL ERROR 1105 (HY000): errCode = 2, detailMessage = [DATA_QUALITY_ERROR]too many filtered rows with load id b612907c-ccf4-4ac2-82fe-107ece655f0f ``` -If you meets this error, you can extract the `loadId` and use it in the `show load warnings` command to get more detail message. -```sql +The `loadId` is included in the error message, based on which you can view the detailed information via the `show load warnings` command: + +```SQL show load warnings where label='b612907c-ccf4-4ac2-82fe-107ece655f0f'; ``` -The loadId was the label in this case. +### Cancel a MySQL Load job + +Doris does not allow manual cancellation of MySQL Load jobs. In the event of a timeout or import error, the corresponding MySQL Load job will be automatically cancelled by the system. + +## Manual + +### Syntax + +The syntax for LOAD DATA is as follows: + +```SQL +LOAD DATA LOCAL +INFILE '' +INTO TABLE [.] +[PARTITION (partition_name [, partition_name] ...)] +[COLUMNS TERMINATED BY ''] +[LINES TERMINATED BY ''] +[IGNORE LINES] +[(col_name_or_user_var[, col_name_or_user_var] ...)] +[SET col_name={expr | DEFAULT}[, col_name={expr | DEFAULT}] ...] +[PROPERTIES (key1 = value1 [, key2=value2]) ] +``` +Descriptions of modules in the Load job: -### Configuration -1. `mysql_load_thread_pool`: the thread pool size for singe FE node, set 4 thread by default. The block queue size is 5 times of `mysql_load_thread_pool`. So FE can accept 4 + 4\*5 = 24 requests in one time. Increase this configuration if the parallelism are larger than 24. -2. `mysql_load_server_secure_path`: the secure path for load data from server. Empty path by default means that it's not allowed for server load. Recommend to create a `local_import_data` directory under `DORIS_HOME` to load data if you want enable it. -3. `mysql_load_in_memory_record` The failed mysql load record size. The record was keep in memory and only have 20 records by default. If you want to track more records, you can rise the config but be careful about the fe memory. This record will expired after one day and there is a async thread to clean it in every day. +| Module | Description | +| --------------------- | ------------------------------------------------------------ | +| INFILE | This specifies the local file path, which can be either a relative path or an absolute path.Currently, load_data_file only supports a single file. | +| INTO TABLE | This specifies the database and table, and the database name can be omitted. | +| PARTITION | This specifies the target partition. If the user can determine the partition corresponding to the data, it is recommended to specify this. Data that does not fit into the specified partitions will be filtered out. | +| COLUMNS TERMINATED BY | This specifies the column delimiter. | +| LINE TERMINATED BY | This specifies the row delimiter. | +| IGNORE num LINES | This specifies the number of rows to skip in the CSV import, typically specified as 1 to skip the header. | +| col_name_or_user_var | This specifies the column mapping syntax. For more information, refer to [Column Mapping](https://doris.apache.org/docs/2.0/data-operate/import/load-data-convert#column-mapping). | +| PROPERTIES | Parameters for the Load. | +### Parameters -## Notice +By the `PROPERTIES (key1 = value1 [, key2=value2])` syntax, you can configure the parameters for the Load. + +| Parameter | Description | +| ------------------ | ------------------------------------------------------------ | +| max_filter_ratio | The maximum filtering rate allowed. Must be between 0 and 1, inclusive. The default value is 0, indicating no tolerance for any error rows. | +| timeout | The import timeout, measured in seconds. The default value is 600. The range allowed is from 1s to 259200s. | +| strict_mode | Whether to enable strict mode for this import. Disabled by default. | +| timezone | The time zone for this import. The default time zone is UTC+8. This parameter will affect the results of any time zone-related functions involved in the import. | +| exec_mem_limit | Memory limit for the import, defaults to 2GB, measured in bytes. | +| trim_double_quotes | Boolean, defaults to false. If this is set to true, the outermost double quotes will be trimmed from each field in the import file. | +| enclose | This specifies the enclosure character. When a CSV data field contains line breaks or column delimiters, you can specify a single-byte character as the enclosure character to prevent accidental truncation.For example, if the column delimiter is ",", and the enclosure character is "'", in data "a,'b,c'", "b,c" will be parsed as one field. | +| escape | This specifies the escape character. This is used when the data contains the same character as the enclosure character, which needs to be treated as part of the field.For example, if the data is "a,'b,'c'", the enclosure character is "'", and you want "b,'c" to be parsed as one field, you need to specify a single-byte escape character, such as "\", to modify the data to "a,'b,\'c'". | + +## Example + +### Specify load timeout + +You can adjust the import timeout by specifying `timeout` in PROPERTIES. For example, set it to 100s: + +```SQL +LOAD DATA LOCAL +INFILE 'testData' +INTO TABLE testDb.testTbl +PROPERTIES ("timeout"="100"); +``` -1. If you see this `LOAD DATA LOCAL INFILE file request rejected due to restrictions on access` message, you should connet mysql with `mysql --local-infile=1` command to enable client to load local file. -2. The configuration for StreamLoad will also affect MySQL Load. Such as the configurate in be named `streaming_load_max_mb`, it's 10GB by default and it will control the max size for one load. +### Specify allowable error rate + +You can adjust the allowable error rate by specifying `max_filter_ratio` in PROPERTIES. For example, set it to 20%: + +```SQL +LOAD DATA LOCAL +INFILE 'testData' +INTO TABLE testDb.testTbl +PROPERTIES ("max_filter_ratio"="0.2"); +``` + +### Import column mapping + +The following example adjusts the order of columns in the CSV file. + +```SQL +LOAD DATA LOCAL +INFILE 'testData' +INTO TABLE testDb.testTbl +(k2, k1, v1); +``` + +### Specify column and row delimiters + +You can specify the column and row delimiters using the `COLUMNS TERMINATED BY` and `LINES TERMINATED BY` clauses. In the following example, (,) and (\n) are used as the column and row delimiters, respectively. + +```SQL +LOAD DATA LOCAL +INFILE 'testData' +COLUMNS TERMINATED BY ',' +LINES TERMINATED BY '\n' +INTO TABLE testDb.testTbl; +``` + +### Specify target partition + +You can specify the target partition for the import using the `PARTITION` clause. In the following example, data will be loaded into the specified partitions 'p1' and 'p2', and any data that does not belong to these two partitions will be filtered out: + +```SQL +LOAD DATA LOCAL +INFILE 'testData' +INTO TABLE testDb.testTbl +PARTITION (p1, p2); +``` + +### Specify time zone + +You can specify the `timezone` in PROPERTIES. In the following example, the timezone is set to Africa/Abidjan: + +```SQL +LOAD DATA LOCAL +INFILE 'testData' +INTO TABLE testDb.testTbl +PROPERTIES ("timezone"="Africa/Abidjan"); +``` + +### Specify the memory limit for the import + +You can specify the memory limit for the import by the `exec_mem_limit` parameter in PROPERTIES. In the following example, the memory limit is set to 10G: + +```SQL +LOAD DATA LOCAL +INFILE 'testData' +INTO TABLE testDb.testTbl +PROPERTIES ("exec_mem_limit"="10737418240"); +``` -## More Help +## More help -1. For more detailed syntax and best practices for using MySQL Load, see the [MySQL Load](../../../sql-manual/sql-reference/Data-Manipulation-Statements/Load/MYSQL-LOAD.md) command manual. +For more detailed syntax and best practices related to MySQL Load, refer to the [MySQL Load](https://doris.apache.org/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/MYSQL-LOAD/) command manual. diff --git a/versioned_docs/version-2.0/table-design/auto-increment.md b/versioned_docs/version-2.0/table-design/auto-increment.md index 9cc51cae8c045..e8031ce39d715 100644 --- a/versioned_docs/version-2.0/table-design/auto-increment.md +++ b/versioned_docs/version-2.0/table-design/auto-increment.md @@ -24,126 +24,66 @@ specific language governing permissions and limitations under the License. --> -# AUTO_INCREMENT Column +:::tip +Auto-increment column is supported in Doris since version 2.1. The documentation of V2.0 only provides an introduction to its functionality and use cases. If you are interested in this feature, please upgrade to Doris 2.1 and refer to the relevant part in the documentation of V2.1. +::: - +For tables that enable an AUTO_INCREMENT Column, Doris assigns a table-unique value to rows that do not have a specified value in the auto-increment column. - +## Description -When importing data, Doris assigns a table-unique value to rows that do not have specified values in the auto-increment column. +For tables that contain an auto-increment column, during data ingestion: -## Functionality - -For tables containing an auto-increment column, during data import: -- If the target columns don't include the auto-increment column, Doris will populate the auto-increment column with generated values. -- If the target columns include the auto-increment column, null values in the imported data for that column will be replaced by values generated by Doris, while non-null values will remain unchanged. Note that **non-null values can disrupt the uniqueness of the auto-increment column values**. +- If the auto-increment column is **NOT** one of the target columns of this import, it will be auto-filled with values generated by Doris. The auto-filled values are unique for each row. +- If the auto-increment column is one of the target columns of this import, the imported null values in this column will be replaced with values generated by Doris, while non-null values will remain unchanged. Note that the existence of the imported non-null values might compromise the uniqueness of values in the auto-increment columns. ### Uniqueness -Doris ensures that values generated on the auto-increment column have **table-wide uniqueness**. However, it's important to note that **the uniqueness of the auto-increment column only guarantees uniqueness for values automatically filled by Doris and does not consider values provided by users**. If a user explicitly inserts user-provided values for this table by specifying the auto-increment column, this uniqueness cannot be guaranteed. - -### Density - -Doris ensures that the values generated on the auto-increment column are dense, but it **cannot guarantee** that the values automatically generated in the auto-increment column during an import will be entirely contiguous. Thus, there might be some jumps in the values generated by the auto-increment column during an import. This is because, for performance consideration, each BE caches a portion of pre-allocated auto-increment column values, and these cached values do not intersect between different BEs. Additionally, due to this caching mechanism, Doris cannot guarantee that the values automatically generated on the auto-increment column in a later import on the physical timeline will be larger than those from the previous import. Therefore, the values allocated by the auto-increment column cannot be used to determine the chronological order of imports. - -## Syntax +Doris ensures that the values generated on the auto-increment column are unique within the table. However, Doris can only guarantee the uniqueness of values generated by itself. If the user inserts values into the auto-increment column explicitly, it does not guarantee the table-wide uniqueness of such values. -To use auto-increment columns, you need to add the `AUTO_INCREMENT` attribute to the corresponding column during table creation ([CREATE-TABLE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE)). To manually specify the starting value for an auto-increment column, you can do so by using the `AUTO_INCREMENT(start_value)` statement when creating the table. If not specified, the default starting value is 1. +### Sequentialness -### Examples +Doris ensures that the values generated on the auto-increment column are dense. However, it does not guarantee that the values populated by auto-increment during a single import will be completely continuous. This is because, for performance reasons, each backend (BE) caches a portion of pre-allocated auto-increment values, and the cached values do not overlap between different BEs. Additionally, due to the existence of caching, Doris cannot guarantee that the auto-increment values generated in a subsequent import will be larger than those generated in a previous import in terms of physical time. Therefore, the order of import cannot be determined based on the magnitude of auto-increment values allocated. -1. Creating a Duplicate table with one key column as an auto-increment column: - - ```sql - CREATE TABLE `demo`.`tbl` ( - `id` BIGINT NOT NULL AUTO_INCREMENT, - `value` BIGINT NOT NULL - ) ENGINE=OLAP - DUPLICATE KEY(`id`) - DISTRIBUTED BY HASH(`id`) BUCKETS 10 - PROPERTIES ( - "replication_allocation" = "tag.location.default: 3" - ); +## Usage -2. Creating a Duplicate table with one key column as an auto-increment column, and set start value is 100: +1. Create a table in the Duplicate Key model, with one of the key columns being an auto-increment column. - ```sql - CREATE TABLE `demo`.`tbl` ( - `id` BIGINT NOT NULL AUTO_INCREMENT(100), - `value` BIGINT NOT NULL - ) ENGINE=OLAP - DUPLICATE KEY(`id`) - DISTRIBUTED BY HASH(`id`) BUCKETS 10 - PROPERTIES ( - "replication_allocation" = "tag.location.default: 3" - ); - ``` +```SQL +CREATE TABLE `demo`.`tbl` ( + `id` BIGINT NOT NULL AUTO_INCREMENT, + `value` BIGINT NOT NULL +) ENGINE=OLAP +DUPLICATE KEY(`id`) +DISTRIBUTED BY HASH(`id`) BUCKETS 10 +PROPERTIES ( +"replication_allocation" = "tag.location.default: 3" +); +``` -3. Creating a Duplicate table with one value column as an auto-increment column: +2. Create a table in the Duplicate Key model, with one of the value columns being an auto-increment column. - ```sql - CREATE TABLE `demo`.`tbl` ( +```SQL +CREATE TABLE `demo`.`tbl` ( `uid` BIGINT NOT NULL, `name` BIGINT NOT NULL, `id` BIGINT NOT NULL AUTO_INCREMENT, `value` BIGINT NOT NULL - ) ENGINE=OLAP - DUPLICATE KEY(`uid`, `name`) - DISTRIBUTED BY HASH(`uid`) BUCKETS 10 - PROPERTIES ( - "replication_allocation" = "tag.location.default: 3" - ); - ``` - -4. Creating a Unique tbl table with one key column as an auto-increment column: - - ```sql - CREATE TABLE `demo`.`tbl` ( - `id` BIGINT NOT NULL AUTO_INCREMENT, - `name` varchar(65533) NOT NULL, - `value` int(11) NOT NULL - ) ENGINE=OLAP - UNIQUE KEY(`id`) - DISTRIBUTED BY HASH(`id`) BUCKETS 10 - PROPERTIES ( - "replication_allocation" = "tag.location.default: 3", - "enable_unique_key_merge_on_write" = "true" - ); - ``` - -5. Creating a Unique tbl table with one value column as an auto-increment column: - - ```sql - CREATE TABLE `demo`.`tbl` ( - `text` varchar(65533) NOT NULL, - `id` BIGINT NOT NULL AUTO_INCREMENT, - ) ENGINE=OLAP - UNIQUE KEY(`text`) - DISTRIBUTED BY HASH(`text`) BUCKETS 10 - PROPERTIES ( - "replication_allocation" = "tag.location.default: 3", - "enable_unique_key_merge_on_write" = "true" - ); - ``` - -### Constraints and Limitations - -- Only Duplicate model tables and Unique model tables can contain auto-increment columns. -- A table can contain at most one auto-increment column. -- The type of the auto-increment column must be BIGINT and must be NOT NULL. -- The manually specified starting value for an auto-increment column must be greater than or equal to 0. - -## Usage - -### Import +) ENGINE=OLAP +DUPLICATE KEY(`uid`, `name`) +DISTRIBUTED BY HASH(`uid`) BUCKETS 10 +PROPERTIES ( +"replication_allocation" = "tag.location.default: 3" +); +``` -Consider the following table: +3. Create a table in the Unique Key model, with one of the key columns being an auto-increment column. -```sql +```SQL CREATE TABLE `demo`.`tbl` ( - `id` BIGINT NOT NULL AUTO_INCREMENT, - `name` varchar(65533) NOT NULL, - `value` int(11) NOT NULL + `id` BIGINT NOT NULL AUTO_INCREMENT, + `name` varchar(65533) NOT NULL, + `value` int(11) NOT NULL ) ENGINE=OLAP UNIQUE KEY(`id`) DISTRIBUTED BY HASH(`id`) BUCKETS 10 @@ -153,203 +93,30 @@ PROPERTIES ( ); ``` -When using the insert into statement to import data without specifying the auto-increment column `id`, the `id` column will automatically be filled with generated values. - -```sql -mysql> insert into tbl(name, value) values("Bob", 10), ("Alice", 20), ("Jack", 30); -Query OK, 3 rows affected (0.09 sec) -{'label':'label_183babcb84ad4023_a2d6266ab73fb5aa', 'status':'VISIBLE', 'txnId':'7'} - -mysql> select * from tbl order by id; -+------+-------+-------+ -| id | name | value | -+------+-------+-------+ -| 1 | Bob | 10 | -| 2 | Alice | 20 | -| 3 | Jack | 30 | -+------+-------+-------+ -3 rows in set (0.05 sec) -``` - -Similarly, using stream load to import the file test.csv without specifying the auto-increment column `id` will result in the id column being automatically filled with generated values. +4. Create a table in the Unique Key model, with one of the value columns being an auto-increment column. -test.csv: -``` -Tom, 40 -John, 50 -``` - -``` -curl --location-trusted -u user:passwd -H "columns:name,value" -H "column_separator:," -T ./test1.csv http://{host}:{port}/api/{db}/tbl/_stream_load -``` - -```sql -mysql> select * from tbl order by id; -+------+-------+-------+ -| id | name | value | -+------+-------+-------+ -| 1 | Bob | 10 | -| 2 | Alice | 20 | -| 3 | Jack | 30 | -| 4 | Tom | 40 | -| 5 | John | 50 | -+------+-------+-------+ -5 rows in set (0.04 sec) -``` - -When importing using insert into statement while specifying the auto-increment column `id`, null values in the imported data for that column will be replaced by generated values. - -```sql -mysql> insert into tbl(id, name, value) values(null, "Doris", 60), (null, "Nereids", 70); -Query OK, 2 rows affected (0.07 sec) -{'label':'label_9cb0c01db1a0402c_a2b8b44c11ce4703', 'status':'VISIBLE', 'txnId':'10'} - -mysql> select * from tbl order by id; -+------+---------+-------+ -| id | name | value | -+------+---------+-------+ -| 1 | Bob | 10 | -| 2 | Alice | 20 | -| 3 | Jack | 30 | -| 4 | Tom | 40 | -| 5 | John | 50 | -| 6 | Doris | 60 | -| 7 | Nereids | 70 | -+------+---------+-------+ -7 rows in set (0.04 sec) -``` - -### Partial Update - -When performing a partial update on a merge-on-write Unique table containing an auto-increment column: - -If the auto-increment column is a key column, during partial updates, as users must explicitly specify the key column, the target columns for partial column updates must include the auto-increment column. In this scenario, the import behavior is similar to regular partial updates. - -```sql -mysql> CREATE TABLE `demo`.`tbl2` ( - -> `id` BIGINT NOT NULL AUTO_INCREMENT, - -> `name` varchar(65533) NOT NULL, - -> `value` int(11) NOT NULL DEFAULT "0" - -> ) ENGINE=OLAP - -> UNIQUE KEY(`id`) - -> DISTRIBUTED BY HASH(`id`) BUCKETS 10 - -> PROPERTIES ( - -> "replication_allocation" = "tag.location.default: 3", - -> "enable_unique_key_merge_on_write" = "true" - -> ); -Query OK, 0 rows affected (0.03 sec) - -mysql> insert into tbl2(id, name, value) values(1, "Bob", 10), (2, "Alice", 20), (3, "Jack", 30); -Query OK, 3 rows affected (0.14 sec) -{'label':'label_5538549c866240b6_bce75ef323ac22a0', 'status':'VISIBLE', 'txnId':'1004'} - -mysql> select * from tbl2 order by id; -+------+-------+-------+ -| id | name | value | -+------+-------+-------+ -| 1 | Bob | 10 | -| 2 | Alice | 20 | -| 3 | Jack | 30 | -+------+-------+-------+ -3 rows in set (0.08 sec) - -mysql> set enable_unique_key_partial_update=true; -Query OK, 0 rows affected (0.01 sec) - -mysql> set enable_insert_strict=false; -Query OK, 0 rows affected (0.00 sec) - -mysql> insert into tbl2(id, name) values(1, "modified"), (4, "added"); -Query OK, 2 rows affected (0.06 sec) -{'label':'label_3e68324cfd87457d_a6166cc0a878cfdc', 'status':'VISIBLE', 'txnId':'1005'} - -mysql> select * from tbl2 order by id; -+------+----------+-------+ -| id | name | value | -+------+----------+-------+ -| 1 | modified | 10 | -| 2 | Alice | 20 | -| 3 | Jack | 30 | -| 4 | added | 0 | -+------+----------+-------+ -4 rows in set (0.04 sec) -``` - -When the auto-increment column is a non-key column and users haven't specified the value for the auto-increment column, the value will be filled from existing data rows in the table. If users specify the auto-increment column, null values in the imported data for that column will be replaced by generated values, while non-null values will remain unchanged, and then these data will be loaded with the semantics of partial updates. - -```sql -mysql> CREATE TABLE `demo`.`tbl3` ( - -> `id` BIGINT NOT NULL, - -> `name` varchar(100) NOT NULL, - -> `score` BIGINT NOT NULL, - -> `aid` BIGINT NOT NULL AUTO_INCREMENT - -> ) ENGINE=OLAP - -> UNIQUE KEY(`id`) - -> DISTRIBUTED BY HASH(`id`) BUCKETS 1 - -> PROPERTIES ( - -> "replication_allocation" = "tag.location.default: 3", - -> "enable_unique_key_merge_on_write" = "true" - -> ); -Query OK, 0 rows affected (0.16 sec) - -mysql> insert into tbl3(id, name, score) values(1, "Doris", 100), (2, "Nereids", 200), (3, "Bob", 300); -Query OK, 3 rows affected (0.28 sec) -{'label':'label_c52b2c246e244dda_9b91ee5e27a31f9b', 'status':'VISIBLE', 'txnId':'2003'} - -mysql> select * from tbl3 order by id; -+------+---------+-------+------+ -| id | name | score | aid | -+------+---------+-------+------+ -| 1 | Doris | 100 | 0 | -| 2 | Nereids | 200 | 1 | -| 3 | Bob | 300 | 2 | -+------+---------+-------+------+ -3 rows in set (0.13 sec) - -mysql> set enable_unique_key_partial_update=true; -Query OK, 0 rows affected (0.00 sec) - -mysql> set enable_insert_strict=false; -Query OK, 0 rows affected (0.00 sec) - -mysql> insert into tbl3(id, score) values(1, 999), (2, 888); -Query OK, 2 rows affected (0.07 sec) -{'label':'label_dfec927d7a4343ca_9f9ade581391de97', 'status':'VISIBLE', 'txnId':'2004'} - -mysql> select * from tbl3 order by id; -+------+---------+-------+------+ -| id | name | score | aid | -+------+---------+-------+------+ -| 1 | Doris | 999 | 0 | -| 2 | Nereids | 888 | 1 | -| 3 | Bob | 300 | 2 | -+------+---------+-------+------+ -3 rows in set (0.06 sec) - -mysql> insert into tbl3(id, aid) values(1, 1000), (3, 500); -Query OK, 2 rows affected (0.07 sec) -{'label':'label_b26012959f714f60_abe23c87a06aa0bf', 'status':'VISIBLE', 'txnId':'2005'} - -mysql> select * from tbl3 order by id; -+------+---------+-------+------+ -| id | name | score | aid | -+------+---------+-------+------+ -| 1 | Doris | 999 | 1000 | -| 2 | Nereids | 888 | 1 | -| 3 | Bob | 300 | 500 | -+------+---------+-------+------+ -3 rows in set (0.06 sec) +```SQL +CREATE TABLE `demo`.`tbl` ( + `text` varchar(65533) NOT NULL, + `id` BIGINT NOT NULL AUTO_INCREMENT, +) ENGINE=OLAP +UNIQUE KEY(`text`) +DISTRIBUTED BY HASH(`text`) BUCKETS 10 +PROPERTIES ( +"replication_allocation" = "tag.location.default: 3", +"enable_unique_key_merge_on_write" = "true" +); ``` -## Usage Scenarios +## Applicable scenarios -### Dictionary Encoding +### Dictionary encoding -Using bitmaps for audience analysis in user profile requires building a user dictionary where each user corresponds to a unique integer dictionary value. Aggregating these dictionary values can improve the performance of bitmap. +In user profiling, where bitmap is utilized for user group analysis, analysts often need to build a dictionary, in which each user corresponds to a unique integer value. Sequential dictionary values can improve bitmap performance. -Taking the offline UV and PV analysis scenario as an example, assuming there's a detailed user behavior table: +Take offline UV (Unique Visitors) and PV (Page Views) analysis as an example, assume that all the detailed data is stored in the following user behavior table: -```sql +```SQL CREATE TABLE `demo`.`dwd_dup_tbl` ( `user_id` varchar(50) NOT NULL, `dim1` varchar(50) NOT NULL, @@ -366,10 +133,9 @@ PROPERTIES ( ); ``` -Using the auto-incrementa column to create the following dictionary table: - +Create a dictionary table using AUTO_INCREMENT: -```sql +```SQL CREATE TABLE `demo`.`dictionary_tbl` ( `user_id` varchar(50) NOT NULL, `aid` BIGINT NOT NULL AUTO_INCREMENT @@ -382,27 +148,26 @@ PROPERTIES ( ); ``` -Import the value of `user_id` from existing data into the dictionary table, establishing the mapping of `user_id` to integer values: +Load the existing `user_id` into the dictionary table, and create mappings from `user_id` to integer values: -```sql +```SQL insert into dit_tbl(user_id) select user_id from dwd_dup_tbl group by user_id; ``` -Or import only the value of `user_id` in incrementa data into the dictionary table alternatively: +Or load the incremental `user_id` into the dictionary table: - -```sql +```SQL insert into dit_tbl(user_id) select dwd_dup_tbl.user_id from dwd_dup_tbl left join dictionary_tbl -on dwd_dup_tbl.user_id = dictionary_tbl.user_id where dwd_dup_tbl.visit_time > '2023-12-10' and dictionary_tbl.user_id is NULL; +on dwd_dup_tbl.user_id = dictionary_tbl.user_id where dwd_dup_tbl.visit_time '2023-12-10' and dictionary_tbl.user_id is NULL; ``` -In real-world scenarios, Flink connectors can also be employed to write data into Doris. +You can also use the Flink-Doris-Connector to load data into Doris. -Assuming `dim1`, `dim3`, `dim5` represent statistical dimensions of interest to us, create the following table to store aggregated results: +Suppose that `dim1`, `dim3`, `dim5` are the analytic dimensions to be considered, create an Aggregate Key table to accommodate the results of data aggregation. -```sql +```SQL CREATE TABLE `demo`.`dws_agg_tbl` ( `dim1` varchar(50) NOT NULL, `dim3` varchar(50) NOT NULL, @@ -417,25 +182,25 @@ PROPERTIES ( ); ``` -Store the result of the data aggregation operations into the aggregation result table: +Load the aggregated results into the table: -```sql +```SQL insert into dws_tbl select dwd_dup_tbl.dim1, dwd_dup_tbl.dim3, dwd_dup_tbl.dim5, BITMAP_UNION(TO_BITMAP(dictionary_tbl.aid)), COUNT(1) from dwd_dup_tbl INNER JOIN dictionary_tbl on dwd_dup_tbl.user_id = dictionary_tbl.user_id; ``` -Perform UV and PV queries using the following statement: +Then query PV/UV using the following statement: -```sql +```SQL select dim1, dim3, dim5, user_id_bitmap as uv, pv from dws_agg_tbl; ``` -### Efficient Pagination +### Efficient pagination -When displaying data on a page, pagination is often necessary. Traditional pagination typically involves using `limit`, `offset`, and `order by` in SQL queries. For instance, consider the following business table intended for display: +Pagination is often required in data display. Typically, pagination is implemented by SQL statements like `limit` or `offset` + `order by`. Suppose that you need to display the following table: -```sql +```SQL CREATE TABLE `demo`.`records_tbl` ( `key` int(11) NOT NULL COMMENT "", `name` varchar(26) NOT NULL COMMENT "", @@ -452,23 +217,23 @@ PROPERTIES ( ); ``` -Assuming 100 records are displayed per page in pagination. To fetch the first page's data, the following SQL query can be used: +In pagination queries, if each page displays 100 results, you can retrieve the first page using the following SQL statement: -```sql -select * from records_tbl order by `key`, `name` limit 100; +```SQL +select * from records_tbl order by key, name limit 100; ``` -Fetching the data for the second page can be accomplished by: +You can retrieve the second page using the following SQL statement: -```sql -select * from records_tbl order by `key`, `name` limit 100, offset 100; +```SQL +select * from records_tbl order by key, name limit 100, offset 100; ``` -However, when performing deep pagination queries (with large offsets), even if the actual required data rows are few, this method still reads all data into memory for full sorting before subsequent processing, which is quite inefficient. Using an auto-incrementa column assigns a unique value to each row, allowing the use of where `unique_value` > x limit y to filter a significant amount of data beforehand, making pagination more efficient. +However, when performing deep pagination queries (with a large offset), even if only a few rows of data are actually requested, the previous method still reads all the data into memory, performs a full sorting, and then proceeds with further processing. For higher execution efficiency, you can assign a unique value to each row of data using an auto-increment column. This method records the maximum value of the `unique_value` column from the previous page and uses the `where unique_value > max_value limit rows_per_page` clause to push down the predicate and filter out a significant amount of data in advance, thereby achieving more efficient pagination. -Continuing with the aforementioned business table, an auto-increment column is added to the table to give each row a unique identifier: +In the above example table, you can add an auto-increment column to the table to assign a unique identifier to each row: -```sql +```SQL CREATE TABLE `demo`.`records_tbl2` ( `key` int(11) NOT NULL COMMENT "", `name` varchar(26) NOT NULL COMMENT "", @@ -486,23 +251,25 @@ PROPERTIES ( ); ``` -For pagination displaying 100 records per page, to fetch the first page's data, the following SQL query can be used: +In pagination queries, suppose that each page displays 100 results, you can retrieve the first page by: -```sql +```SQL select * from records_tbl2 order by unique_value limit 100; ``` -By recording the maximum value of unique_value in the returned results, let's assume it's 99. The following query can then fetch data for the second page: +Use programs to record the maximum `unique_value` in the returned result. Suppose that the maximum is 99, you can query data from the second page by: -```sql -select * from records_tbl2 where unique_value > 99 order by unique_value limit 100; +```SQL +select * from records_tbl2 where unique_value 99 order by unique_value limit 100; ``` -If directly querying contents from a later page and it's inconvenient to directly obtain the maximum value of `unique_value` from the preceding page's data (for instance, directly obtaining contents from the 101st page), the following query can be used: +If you need to query data from a deeper page, for example, page 101, which means it's hard to get the maximum `unique_value` from the previous page directly, then you can use the statement as follows: -```sql +```SQL select key, name, address, city, nation, region, phone, mktsegment -from records_tbl2, (select unique_value as max_value from records_tbl2 order by unique_value limit 1 offset 9999) as previous_data -where records_tbl2.unique_value > previous_data.max_value -order by records_tbl2.unique_value limit 100; +from records_tbl2, (select uniuqe_value as max_value from records_tbl2 order by uniuqe_value limit 1 offset 9999) as previous_data +where records_tbl2.uniuqe_value previous_data.max_value +order by unique_value limit 100; ``` + +