From d2a6f4c925435fa8ac477752b712290a76c4e1c6 Mon Sep 17 00:00:00 2001 From: Jingsong Date: Thu, 7 Nov 2024 19:32:08 +0800 Subject: [PATCH] [doc] Add catalog and table types in concept --- docs/content/concepts/catalog.md | 90 ++++++++++++++++ docs/content/concepts/spec/_index.md | 2 +- docs/content/concepts/table-types.md | 149 +++++++++++++++++++++++++++ docs/content/flink/_index.md | 2 +- docs/content/spark/_index.md | 2 +- 5 files changed, 242 insertions(+), 3 deletions(-) create mode 100644 docs/content/concepts/catalog.md create mode 100644 docs/content/concepts/table-types.md diff --git a/docs/content/concepts/catalog.md b/docs/content/concepts/catalog.md new file mode 100644 index 000000000000..9775113a6ef1 --- /dev/null +++ b/docs/content/concepts/catalog.md @@ -0,0 +1,90 @@ +--- +title: "Catalog" +weight: 4 +type: docs +aliases: +- /concepts/catalog.html +--- + + +# Catalog + +Paimon provides a Catalog abstraction to manage the table of contents and metadata. The Catalog abstraction provides +a series of ways to help you better integrate with computing engines. We always recommend that you use Catalog to +access the Paimon table. + +## Catalogs + +Paimon catalogs currently support three types of metastores: + +* `filesystem` metastore (default), which stores both metadata and table files in filesystems. +* `hive` metastore, which additionally stores metadata in Hive metastore. Users can directly access the tables from Hive. +* `jdbc` metastore, which additionally stores metadata in relational databases such as MySQL, Postgres, etc. + +## Filesystem Catalog + +Metadata and table files are stored under `hdfs:///path/to/warehouse`. + +```sql +-- Flink SQL +CREATE CATALOG my_catalog WITH ( + 'type' = 'paimon', + 'warehouse' = 'hdfs:///path/to/warehouse' +); +``` + +## Hive Catalog + +By using Paimon Hive catalog, changes to the catalog will directly affect the corresponding Hive metastore. Tables +created in such catalog can also be accessed directly from Hive. Metadata and table files are stored under +`hdfs:///path/to/warehouse`. In addition, schema is also stored in Hive metastore. + +```sql +-- Flink SQL +CREATE CATALOG my_hive WITH ( + 'type' = 'paimon', + 'metastore' = 'hive', + -- 'warehouse' = 'hdfs:///path/to/warehouse', default use 'hive.metastore.warehouse.dir' in HiveConf +); +``` + +By default, Paimon does not synchronize newly created partitions into Hive metastore. Users will see an unpartitioned +table in Hive. Partition push-down will be carried out by filter push-down instead. + +If you want to see a partitioned table in Hive and also synchronize newly created partitions into Hive metastore, +please set the table option `metastore.partitioned-table` to true. + +## JDBC Catalog + +By using the Paimon JDBC catalog, changes to the catalog will be directly stored in relational databases such as SQLite, +MySQL, postgres, etc. + +```sql +-- Flink SQL +CREATE CATALOG my_jdbc WITH ( + 'type' = 'paimon', + 'metastore' = 'jdbc', + 'uri' = 'jdbc:mysql://:/', + 'jdbc.user' = '...', + 'jdbc.password' = '...', + 'catalog-key'='jdbc', + 'warehouse' = 'hdfs:///path/to/warehouse' +); +``` diff --git a/docs/content/concepts/spec/_index.md b/docs/content/concepts/spec/_index.md index 3bd8e657ffbc..166ce4eeaa54 100644 --- a/docs/content/concepts/spec/_index.md +++ b/docs/content/concepts/spec/_index.md @@ -1,7 +1,7 @@ --- title: Specification bookCollapseSection: true -weight: 4 +weight: 6 --- + +# Table Types + +Paimon supports table types: + +1. table with pk: Paimon Data Table with Primary key +2. table w/o pk: Paimon Data Table without Primary key +3. view: metastore required, views in SQL are a kind of virtual table +4. format-table: file format table refers to a directory that contains multiple files of the same format, where + operations on this table allow for reading or writing to these files, compatible with Hive tables +5. materialized-table: aimed at simplifying both batch and stream data pipelines, providing a consistent development + experience, see [Flink Materialized Table](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/materialized-table/overview/) + +## Table with PK + +See [Paimon with Primary key]({{< ref "primary-key-table/overview" >}}). + +Primary keys consist of a set of columns that contain unique values for each record. Paimon enforces data ordering by +sorting the primary key within each bucket, allowing streaming update and streaming changelog read. + +The definition of primary key is similar to that of standard SQL, as it ensures that there is only one data entry for +the same primary key during batch queries. + +## Table w/o PK + +See [Paimon w/o Primary key]({{< ref "append-table/overview" >}}). + +If a table does not have a primary key defined, it is an append table. Compared to the primary key table, it does not +have the ability to directly receive changelogs. It cannot be directly updated with data through streaming upsert. It +can only receive incoming data from append data. + +However, it also supports batch sql: DELETE, UPDATE, and MERGE-INTO. + +## View + +View is supported when the metastore can support view, for example, hive metastore. + +View will currently save the original SQL. If you need to use View across engines, you can write a cross engine +SQL statement. For example: + +```sql +CREATE VIEW my_view AS SELECT a + 1, b FROM my_db.my_source; +``` + +## Format Table + +Format table is supported when the metastore can support format table, for example, hive metastore. The Hive tables +inside the metastore will be mapped to Paimon's Format Table for computing engines (Spark, Hive, Flink) to read and write. + +Format table refers to a directory that contains multiple files of the same format, where operations on this table +allow for reading or writing to these files, facilitating the retrieval of existing data and the addition of new files. + +Partitioned file format table just like the standard hive format. Partitions are discovered and inferred based on +directory structure. + +Format Table is enabled by default, you can disable it by configuring Catalog option: `'format-table.enabled'`. + +Currently only support `CSV`, `Parquet`, `ORC` formats. + +### CSV + +{{< tabs "format-table-csv" >}} +{{< tab "Flink SQL" >}} + +```sql +CREATE TABLE my_csv_table ( + a INT, + b STRING +) WITH ( + 'type'='format-table', + 'file.format'='csv', + 'field-delimiter'=',' +) +``` +{{< /tab >}} + +{{< tab "Spark SQL" >}} + +```sql +CREATE TABLE my_csv_table ( + a INT, + b STRING +) USING csv OPTIONS ('field-delimiter' ',') +``` + +{{< /tab >}} +{{< /tabs >}} + +Now, only support `'field-delimiter'` option. + +### Parquet & ORC + +{{< tabs "format-table-parquet" >}} +{{< tab "Flink SQL" >}} + +```sql +CREATE TABLE my_parquet_table ( + a INT, + b STRING +) WITH ( + 'type'='format-table', + 'file.format'='parquet' +) +``` +{{< /tab >}} + +{{< tab "Spark SQL" >}} + +```sql +CREATE TABLE my_parquet_table ( + a INT, + b STRING +) USING parquet +``` + +{{< /tab >}} +{{< /tabs >}} + +## Materialized Table + +Materialized Table aimed at simplifying both batch and stream data pipelines, providing a consistent development +experience, see [Flink Materialized Table](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/materialized-table/overview/). + +Now only Flink SQL integrate to Materialized Table, we plan to support it in Spark SQL too. diff --git a/docs/content/flink/_index.md b/docs/content/flink/_index.md index c39ff01d8760..6ec757fa520f 100644 --- a/docs/content/flink/_index.md +++ b/docs/content/flink/_index.md @@ -3,7 +3,7 @@ title: Engine Flink icon: bold: true bookCollapseSection: true -weight: 4 +weight: 5 ---