Skip to content

Commit

Permalink
[local-tvf] add shared_storage parameter doc (apache#494)
Browse files Browse the repository at this point in the history
  • Loading branch information
morningman authored Mar 29, 2024
1 parent ad7b273 commit 4e4fecb
Show file tree
Hide file tree
Showing 4 changed files with 262 additions and 106 deletions.
87 changes: 60 additions & 27 deletions docs/sql-manual/sql-functions/table-functions/local.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,8 @@ under the License.

### Name

<version since="dev">

local

</version>

### Description

Local table-valued-function(tvf), allows users to read and access local file contents on be node, just like accessing relational table. Currently supports `csv/csv_with_names/csv_with_names_and_types/json/parquet/orc` file format.
Expand All @@ -54,38 +50,50 @@ local(

**parameter description**

Related parameters for accessing local file on be node:
- Related parameters for accessing local file on be node:

- `file_path`:

(required) The path of the file to be read, which is a relative path to the `user_files_secure_path` directory, where `user_files_secure_path` parameter [can be configured on be](../../../admin-manual/config/be-config.md).

Can not contains `..` in path. Support using glob syntax to match multi files, such as `log/*.log`

- Related to execution method:

- `file_path`:
In versions prior to 2.1.1, Doris only supported specifying a BE node to read local data files on that node.

(required) The path of the file to be read, which is a relative path to the `user_files_secure_path` directory, where `user_files_secure_path` parameter [can be configured on be](../../../admin-manual/config/be-config.md).
- `backend_id`:

Can not contains `..` in path. Support using glob syntax to match multi files, such as `log/*.log`
The be id where the file is located. `backend_id` can be obtained through the `show backends` command.

- `backend_id`:
Starting from version 2.1.2, Doris adds a new parameter `shared_storage`.

(required) The backend id where the file resides. The `backend_id` can be obtained by `show backends` command.
- `shared_storage`

File format parameters:
Default is false. If true, the specified file exists on shared storage (such as NAS). Shared storage must be compatible with the POXIS file interface and mounted on all BE nodes at the same time.

- `format`: (required) Currently support `csv/csv_with_names/csv_with_names_and_types/json/parquet/orc`
- `column_separator`: (optional) default `,`.
- `line_delimiter`: (optional) default `\n`.
- `compress_type`: (optional) Currently support `UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE`. Default value is `UNKNOWN`, it will automatically infer the type based on the suffix of `uri`.
When `shared_storage` is true, you do not need to set `backend_id`, Doris may use all BE nodes for data access. If `backend_id` is set, still only executes on the specified BE node.

The following 6 parameters are used for loading in json format. For specific usage methods, please refer to: [Json Load](../../../data-operate/import/import-way/load-json-format.md)
- File format parameters:

- `read_json_by_line`: (optional) default `"true"`
- `strip_outer_array`: (optional) default `"false"`
- `json_root`: (optional) default `""`
- `json_paths`: (optional) default `""`
- `num_as_string`: (optional) default `false`
- `fuzzy_parse`: (optional) default `false`
- `format`: (required) Currently support `csv/csv_with_names/csv_with_names_and_types/json/parquet/orc`
- `column_separator`: (optional) default `,`.
- `line_delimiter`: (optional) default `\n`.
- `compress_type`: (optional) Currently support `UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE`. Default value is `UNKNOWN`, it will automatically infer the type based on the suffix of `uri`.

<version since="dev">The following 2 parameters are used for loading in csv format</version>
- The following parameters are used for loading in json format. For specific usage methods, please refer to: [Json Load](../../../data-operate/import/import-way/load-json-format.md)

- `trim_double_quotes`: Boolean type (optional), the default value is `false`. True means that the outermost double quotes of each field in the csv file are trimmed.
- `skip_lines`: Integer type (optional), the default value is 0. It will skip some lines in the head of csv file. It will be disabled when the format is `csv_with_names` or `csv_with_names_and_types`.
- `read_json_by_line`: (optional) default `"true"`
- `strip_outer_array`: (optional) default `"false"`
- `json_root`: (optional) default `""`
- `json_paths`: (optional) default `""`
- `num_as_string`: (optional) default `false`
- `fuzzy_parse`: (optional) default `false`

- The following parameters are used for loading in csv format

- `trim_double_quotes`: Boolean type (optional), the default value is `false`. True means that the outermost double quotes of each field in the csv file are trimmed.
- `skip_lines`: Integer type (optional), the default value is 0. It will skip some lines in the head of csv file. It will be disabled when the format is `csv_with_names` or `csv_with_names_and_types`.

### Examples

Expand Down Expand Up @@ -125,6 +133,25 @@ mysql> select * from local(
+------+---------+--------+
```

Query files on NAS:

```sql
mysql> select * from local(
"file_path" = "/mnt/doris/prefix_*.txt",
"format" = "csv",
"column_separator" =",",
"shared_storage" = "true");
+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | 2 | 3 |
| 1 | 2 | 3 |
| 1 | 2 | 3 |
| 1 | 2 | 3 |
| 1 | 2 | 3 |
+------+------+------+
```

Can be used with `desc function` :

```sql
Expand All @@ -143,8 +170,14 @@ mysql> desc function local(

### Keywords

local, table-valued-function, tvf
local, table-valued-function, tvf

### Best Practice

For more detailed usage of local tvf, please refer to [S3](./s3.md) tvf, The only difference between them is the way of accessing the storage system.
- For more detailed usage of local tvf, please refer to [S3](./s3.md) tvf, The only difference between them is the way of accessing the storage system.

- Access data on NAS through local tvf

NAS shared storage allows to be mounted to multiple nodes at the same time. Each node can access files in the shared storage just like local files. Therefore, the NAS can be thought of as a local file system, accessed through local tvf.

When setting `"shared_storage" = "true"`, Doris will think that the specified file can be accessed from any BE node. When a set of files is specified using wildcards, Doris will distribute requests to access files to multiple BE nodes, so that multiple nodes can be used to perform distributed file scanning and improve query performance.
Original file line number Diff line number Diff line change
Expand Up @@ -28,19 +28,16 @@ under the License.

### Name

<version since="dev">

local

</version>

### Description

Local表函数(table-valued-function,tvf),可以让用户像访问关系表格式数据一样,读取并访问 be 上的文件内容。目前支持`csv/csv_with_names/csv_with_names_and_types/json/parquet/orc`文件格式。

该函数需要 ADMIN 权限。

#### syntax

```sql
local(
"file_path" = "path/to/file.txt",
Expand All @@ -53,36 +50,50 @@ local(

**参数说明**

访问local文件的相关参数:
- `file_path`
- 访问local文件的相关参数:

(必填)待读取文件的路径,该路径是一个相对于 `user_files_secure_path` 目录的相对路径, 其中 `user_files_secure_path` 参数是 [be的一个配置项](../../../admin-manual/config/be-config.md)
- `file_path`

(必填)待读取文件的路径,该路径是一个相对于 `user_files_secure_path` 目录的相对路径, 其中 `user_files_secure_path` 参数是 [be的一个配置项](../../../admin-manual/config/be-config.md)

路径中不能包含 `..`,可以使用 glob 语法进行模糊匹配,如:`logs/*.log`

路径中不能包含 `..`,可以使用 glob 语法进行模糊匹配,如:`logs/*.log`
- 执行方式相关:

- `backend_id`:
在 2.1.1 之前的版本中,Doris 仅支持指定某一个 BE 节点,读取该节点上的本地数据文件。

- `backend_id`:

文件所在的 be id。 `backend_id` 可以通过 `show backends` 命令得到。

从 2.1.2 版本开始,Doris 增加了新的参数 `shared_storage`

- `shared_storage`

默认为 false。如果为 true,表示指定的文件存在于共享存储上(比如 NAS)。共享存储必须兼容 POXIS 文件接口,并且同时挂载在所有 BE 节点上。

`shared_storage` 为 true 时,可以不设置 `backend_id`,Doris 可能会利用到所有 BE 节点进行数据访问。如果设置了 `backend_id`,则仍然仅在指定 BE 节点上执行。

(必填)文件所在的 be id。 `backend_id` 可以通过 `show backends` 命令得到。
- 文件格式相关参数:

文件格式相关参数
- `format`:(必填) 目前支持 `csv/csv_with_names/csv_with_names_and_types/json/parquet/orc`
- `column_separator`:(选填) 列分割符, 默认为`,`
- `line_delimiter`:(选填) 行分割符,默认为`\n`
- `compress_type`: (选填) 目前支持 `UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE`。 默认值为 `UNKNOWN`, 将会根据 `uri` 的后缀自动推断类型。
- `format`:(必填) 目前支持 `csv/csv_with_names/csv_with_names_and_types/json/parquet/orc`
- `column_separator`:(选填) 列分割符, 默认为`,`
- `line_delimiter`:(选填) 行分割符,默认为`\n`
- `compress_type`: (选填) 目前支持 `UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE`。 默认值为 `UNKNOWN`, 将会根据 `uri` 的后缀自动推断类型。

下面6个参数是用于json格式的导入,具体使用方法可以参照:[Json Load](../../../data-operate/import/import-way/load-json-format.md)
- 以下参数适用于json格式的导入,具体使用方法可以参照:[Json Load](../../../data-operate/import/import-way/load-json-format.md)

- `read_json_by_line`: (选填) 默认为 `"true"`
- `strip_outer_array`: (选填) 默认为 `"false"`
- `json_root`: (选填) 默认为空
- `json_paths`: (选填) 默认为空
- `num_as_string`: (选填) 默认为 `false`
- `fuzzy_parse`: (选填) 默认为 `false`
- `read_json_by_line`: (选填) 默认为 `"true"`
- `strip_outer_array`: (选填) 默认为 `"false"`
- `json_root`: (选填) 默认为空
- `json_paths`: (选填) 默认为空
- `num_as_string`: (选填) 默认为 `false`
- `fuzzy_parse`: (选填) 默认为 `false`

<version since="dev">下面2个参数是用于csv格式的导入</version>
- 以下参数适用于csv格式的导入:

- `trim_double_quotes`: 布尔类型,选填,默认值为 `false`,为 `true` 时表示裁剪掉 csv 文件每个字段最外层的双引号
- `skip_lines`: 整数类型,选填,默认值为0,含义为跳过csv文件的前几行。当设置format设置为 `csv_with_names``csv_with_names_and_types` 时,该参数会失效
- `trim_double_quotes`: 布尔类型,选填,默认值为 `false`,为 `true` 时表示裁剪掉 csv 文件每个字段最外层的双引号
- `skip_lines`: 整数类型,选填,默认值为0,含义为跳过csv文件的前几行。当设置format设置为 `csv_with_names``csv_with_names_and_types` 时,该参数会失效

### Examples

Expand Down Expand Up @@ -122,6 +133,25 @@ mysql> select * from local(
+------+---------+--------+
```

访问 NAS 上的共享数据:

```sql
mysql> select * from local(
"file_path" = "/mnt/doris/prefix_*.txt",
"format" = "csv",
"column_separator" =",",
"shared_storage" = "true");
+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | 2 | 3 |
| 1 | 2 | 3 |
| 1 | 2 | 3 |
| 1 | 2 | 3 |
| 1 | 2 | 3 |
+------+------+------+
```

可以配合`desc function`使用

```sql
Expand All @@ -144,4 +174,19 @@ mysql> desc function local(

### Best Practice

关于local tvf的更详细使用方法可以参照 [S3](./s3.md) tvf, 唯一不同的是访问存储系统的方式不一样。
- 关于 local tvf 的更详细使用方法可以参照 [S3](./s3.md) tvf, 唯一不同的是访问存储系统的方式不一样。

- 通过 local tvf 访问 NAS 上的数据

NAS 共享存储允许同时挂载到多个节点。每个节点都可以像访问本地文件一样访问共享存储中的文件。因此,可以将 NAS 视为本地文件系统,通过 local tvf 进行访问。

当设置 `"shared_storage" = "true"` 时,Doris 会认为所指定的文件可以在任意 BE 节点访问。当使用通配符指定了一组文件时,Doris 会将访问文件的请求分发到多个 BE 节点上,这样可以利用多个节点的进行分布式文件扫描,提升查询性能。









Loading

0 comments on commit 4e4fecb

Please sign in to comment.