Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hive] Introduce metastore.tag-to-partition for Hive metastore #2134

Merged
merged 5 commits into from
Oct 25, 2023

Conversation

JingsongLi
Copy link
Contributor

@JingsongLi JingsongLi commented Oct 16, 2023

Purpose

Linked issue: close #2125

The Tag will maintain the manifests and data files of the snapshot.
A typical usage is creating tags daily, then you can maintain the historical data of each day for batch reading.

When using primary key tables, a non-partitioned approach is often used to maintain updates, in order to mirror and
synchronize tables from upstream database tables. This allows users to query the latest data. The tradition of Hive
data warehouses is not like this. Offline data warehouses require an immutable view every day to ensure the idempotence
of calculations. So we created a Tag mechanism to output these views.

However, the traditional use of Hive data warehouses is more accustomed to using partitions to specify the query's Tag,
and is more accustomed to using Hive computing engines.

So, we introduce 'metastore.tag-to-partition' to mapping a non-partitioned primary key table to the partition table
in Hive metastore, and mapping the partition field to the name of the Tag to be fully compatible with Hive.

Example

Step 1: Create table and tag in Flink SQL

CREATE CATALOG my_hive WITH (
    'type' = 'paimon',
    'metastore' = 'hive',
    'uri' = 'thrift://<hive-metastore-host-name>:<port>',
    -- 'hive-conf-dir' = '...', this is recommended in the kerberos environment
    -- 'hadoop-conf-dir' = '...', this is recommended in the kerberos environment
    'warehouse' = 'hdfs:///path/to/warehouse'
);

USE CATALOG my_hive;

CREATE TABLE mydb.T (
    pk INT,
    col1 STRING,
    col2 STRING
) WITH (
    'bucket' = '-1',
    'metastore.tag-to-partition' = 'dt'
);

INSERT INTO t VALUES (1, '10', '100'), (2, '20', '200');

-- create tag '2023-10-16' for snapshot 1
CALL my_hive.system.create_tag('mydb.T', '2023-10-16', 1);

Step 2: Query table in Hive with Partition Pruning

SHOW PARTITIONS T;
/*
OK
dt=2023-10-16
*/

SELECT * FROM T WHERE dt='2023-10-16';
/*
OK
1 10 100 2023-10-16
2 20 200 2023-10-16
*/

@JingsongLi
Copy link
Contributor Author

JingsongLi commented Oct 16, 2023

Current question: How to query the data for today? The tag creation has not been completed yet.

The above example can only query tags that have already been created, but Paimon is a real-time data lake, and you also
need to query the latest data. Therefore, Paimon provides a preview feature:

Step 1: Create table and tag in Flink SQL

CREATE CATALOG my_hive WITH (
    'type' = 'paimon',
    'metastore' = 'hive',
    'uri' = 'thrift://<hive-metastore-host-name>:<port>',
    -- 'hive-conf-dir' = '...', this is recommended in the kerberos environment
    -- 'hadoop-conf-dir' = '...', this is recommended in the kerberos environment
    'warehouse' = 'hdfs:///path/to/warehouse'
);

USE CATALOG my_hive;

CREATE TABLE mydb.T (
    pk INT,
    col1 STRING,
    col2 STRING
) WITH (
    'bucket' = '-1',
    'metastore.tag-to-partition' = 'dt',
    -- preview tag creation mode process-time
    -- paimon will create partitions early based on process-time
    'metastore.tag-to-partition-preview' = 'process-time'
);

INSERT INTO t VALUES (1, '10', '100'), (2, '20', '200');

-- create tag '2023-10-16' for snapshot 1
CALL sys.create_tag('mydb.T', '2023-10-16', 1);

-- new data in '2023-10-17'
INSERT INTO t VALUES (3, '30', '300'), (4, '40', '400');

-- haven't finished writing the data for '2023-10-17' yet, so there's no need to create a tag for now
-- but the data is already visible for Hive

Step 2: Query table in Hive with Partition Pruning

SHOW PARTITIONS T;
/*
OK
dt=2023-10-16
dt=2023-10-17
*/

SELECT * FROM T WHERE dt='2023-10-17';
-- preview tag '2023-10-17'
/*
OK
1 10 100 2023-10-17
2 20 200 2023-10-17
3 30 300 2023-10-17
4 40 400 2023-10-17
*/

@leaves12138
Copy link
Contributor

+1

@JingsongLi JingsongLi merged commit 28ebec5 into apache:master Oct 25, 2023
@JingsongLi JingsongLi deleted the tagToPartitionField branch January 3, 2024 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Introduce metastore.tag-to-partition-field for Hive metastore
2 participants