[core]: support rename primary key columns and bucket key columns #3809

zhongyujiang · 2024-07-24T15:10:45Z

Purpose

Support renaming primary keys and bucket keys that are not partition keys.

Tests

Added schema evolution tests and Spark e2e tests

API and Format

Not affected

Documentation

zhongyujiang · 2024-07-24T15:12:33Z

cc @JingsongLi @FangYongs Could you please review this? Thanks

JingsongLi · 2024-07-28T03:19:26Z

CC @tsreaper

zhongyujiang · 2024-07-30T11:12:02Z

Gentle ping @JingsongLi @tsreaper, can you help review this when you have time? This can be useful to us, thanks!

JingsongLi

Hi @zhongyujiang , can you rebase master?

Can this support when table is empty?

zhongyujiang · 2024-07-31T14:18:13Z

Hi @zhongyujiang , can you rebase master?

Rebased.

Can this support when table is empty?

I guess this is a typo, what you mean is 'non-empty'?
It appears that the bucket key and primary key are also projected based on the column name mapping, so renaming them does not affect the reading of old data. For this, I have added an end-to-end test testRenamePrimaryKey .

zhongyujiang · 2024-08-04T14:39:52Z

Can this support when table is empty?

I guess this is a typo, what you mean is 'non-empty'?
It appears that the bucket key and primary key are also projected based on the column name mapping, so renaming them does not affect the reading of old data. For this, I have added an end-to-end test testRenamePrimaryKey .

org.apache.paimon.schema.SchemaEvolutionUtil#createDataProjection

I have investigate more to confirm this.
Before reading the data files, the Reader maps the primary key (and other value columns) to the actual columns in the data files. Therefore, I believe that renaming the primary key should not affect the reading of historical data, as long as the partition columns are not renamed. This is because we rely on them when calculating the paths.
The relevant code can be found in org.apache.paimon.schema.SchemaEvolutionUtil#createDataProjection.

zhongyujiang · 2024-08-04T14:42:20Z

...rk/paimon-spark-common/src/test/java/org/apache/paimon/spark/SparkSchemaEvolutionITCase.java

+
+        assertThat(actual).containsExactlyInAnyOrder("[1,aaa]", "[2,bbb]");
+
+        spark.sql("INSERT INTO test_rename_primary_key_table VALUES(1, 'AAA'), (2, 'BBB')");


This refers to the actual column projection used when reading data files during the debugging process.

This refers to data written before rename, the projection columns are _KEY_a and a:

This refers to data written after rename, the projection columns are _KEY_a_ and a_:

zhongyujiang · 2024-08-06T04:31:50Z

The CI failure seems caused by flaky tests, unreleated to this, I've filed a issue on that: #3908

JingsongLi · 2024-08-06T09:45:46Z

Hi @zhongyujiang , can you rebase master?

Rebased.

Can this support when table is empty?

I guess this is a typo, what you mean is 'non-empty'? It appears that the bucket key and primary key are also projected based on the column name mapping, so renaming them does not affect the reading of old data. For this, I have added an end-to-end test testRenamePrimaryKey .

the bucket key and primary key are also projected based on the column name mapping.

It is true, but this is very dangerous and may lead to many bugs.

zhongyujiang · 2024-08-06T13:38:02Z

It is true, but this is very dangerous and may lead to many bugs.

Hi @JingsongLi Could you share your specific concerns? From what I've observed, renaming a primary key doesn't seem to differ from renaming other keys; it appears to be safe.

I believe renaming the primary key is a necessary part of full schema evolution, and we have such a requirement ourselves. So if you have any concerns, I would be more than happy to investigate further and try to address them. Thank you!

JingsongLi · 2024-10-30T06:53:52Z

re-open to trigger tests.

JingsongLi · 2024-10-30T10:57:51Z

+1

…che#3809)

JingsongLi reviewed Jul 30, 2024

View reviewed changes

zhongyujiang force-pushed the support-pk-and-bucket-key-rename branch 2 times, most recently from e5c7d3d to e1075ab Compare July 31, 2024 14:11

zhongyujiang commented Aug 4, 2024

View reviewed changes

zhongyujiang force-pushed the support-pk-and-bucket-key-rename branch from e1075ab to 8669d0c Compare August 5, 2024 13:24

JingsongLi closed this Oct 30, 2024

JingsongLi reopened this Oct 30, 2024

zhongyujiang and others added 6 commits October 30, 2024 16:47

Core: Support rename primary key columns and bucket key columns.

2f2ca6f

Add comments.

15a680f

Fix tests.

b2f041b

Fix tests.

4e6a790

Empty commit to trigger CI.

ffc6b0a

fix compile

9ca3ff8

JingsongLi force-pushed the support-pk-and-bucket-key-rename branch from 71d7d27 to 9ca3ff8 Compare October 30, 2024 08:52

JingsongLi merged commit 97b9b33 into apache:master Oct 30, 2024
12 checks passed

zhongyujiang deleted the support-pk-and-bucket-key-rename branch November 6, 2024 09:23

hang8929201 pushed a commit to hang8929201/paimon that referenced this pull request Nov 7, 2024

[core] support rename primary key columns and bucket key columns (apa…

6ade4de

…che#3809)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core]: support rename primary key columns and bucket key columns #3809

[core]: support rename primary key columns and bucket key columns #3809

zhongyujiang commented Jul 24, 2024

zhongyujiang commented Jul 24, 2024

JingsongLi commented Jul 28, 2024

zhongyujiang commented Jul 30, 2024

JingsongLi left a comment

zhongyujiang commented Jul 31, 2024

zhongyujiang commented Aug 4, 2024

zhongyujiang Aug 4, 2024

zhongyujiang commented Aug 6, 2024

JingsongLi commented Aug 6, 2024

zhongyujiang commented Aug 6, 2024

JingsongLi commented Oct 30, 2024

JingsongLi commented Oct 30, 2024


		assertThat(actual).containsExactlyInAnyOrder("[1,aaa]", "[2,bbb]");

		spark.sql("INSERT INTO test_rename_primary_key_table VALUES(1, 'AAA'), (2, 'BBB')");

[core]: support rename primary key columns and bucket key columns #3809

[core]: support rename primary key columns and bucket key columns #3809

Conversation

zhongyujiang commented Jul 24, 2024

Purpose

Tests

API and Format

Documentation

zhongyujiang commented Jul 24, 2024

JingsongLi commented Jul 28, 2024

zhongyujiang commented Jul 30, 2024

JingsongLi left a comment

Choose a reason for hiding this comment

zhongyujiang commented Jul 31, 2024

zhongyujiang commented Aug 4, 2024

zhongyujiang Aug 4, 2024

Choose a reason for hiding this comment

zhongyujiang commented Aug 6, 2024

JingsongLi commented Aug 6, 2024

zhongyujiang commented Aug 6, 2024

JingsongLi commented Oct 30, 2024

JingsongLi commented Oct 30, 2024