Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core]: support rename primary key columns and bucket key columns #3809

Merged

Conversation

zhongyujiang
Copy link
Contributor

Purpose

Support renaming primary keys and bucket keys that are not partition keys.

Tests

Added schema evolution tests and Spark e2e tests

API and Format

Not affected

Documentation

@zhongyujiang
Copy link
Contributor Author

cc @JingsongLi @FangYongs Could you please review this? Thanks

@JingsongLi
Copy link
Contributor

CC @tsreaper

@zhongyujiang
Copy link
Contributor Author

Gentle ping @JingsongLi @tsreaper, can you help review this when you have time? This can be useful to us, thanks!

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zhongyujiang , can you rebase master?

Can this support when table is empty?

@zhongyujiang zhongyujiang force-pushed the support-pk-and-bucket-key-rename branch 2 times, most recently from e5c7d3d to e1075ab Compare July 31, 2024 14:11
@zhongyujiang
Copy link
Contributor Author

Hi @zhongyujiang , can you rebase master?

Rebased.

Can this support when table is empty?

I guess this is a typo, what you mean is 'non-empty'?
It appears that the bucket key and primary key are also projected based on the column name mapping, so renaming them does not affect the reading of old data. For this, I have added an end-to-end test testRenamePrimaryKey .

@zhongyujiang
Copy link
Contributor Author

Can this support when table is empty?

I guess this is a typo, what you mean is 'non-empty'?
It appears that the bucket key and primary key are also projected based on the column name mapping, so renaming them does not affect the reading of old data. For this, I have added an end-to-end test testRenamePrimaryKey .

org.apache.paimon.schema.SchemaEvolutionUtil#createDataProjection

I have investigate more to confirm this.
Before reading the data files, the Reader maps the primary key (and other value columns) to the actual columns in the data files. Therefore, I believe that renaming the primary key should not affect the reading of historical data, as long as the partition columns are not renamed. This is because we rely on them when calculating the paths.
The relevant code can be found in org.apache.paimon.schema.SchemaEvolutionUtil#createDataProjection.


assertThat(actual).containsExactlyInAnyOrder("[1,aaa]", "[2,bbb]");

spark.sql("INSERT INTO test_rename_primary_key_table VALUES(1, 'AAA'), (2, 'BBB')");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to the actual column projection used when reading data files during the debugging process.

This refers to data written before rename, the projection columns are _KEY_a and a:
Screenshot 2024-08-04 at 21 51 16

This refers to data written after rename, the projection columns are _KEY_a_ and a_:
Screenshot 2024-08-04 at 22 11 52

@zhongyujiang zhongyujiang force-pushed the support-pk-and-bucket-key-rename branch from e1075ab to 8669d0c Compare August 5, 2024 13:24
@zhongyujiang
Copy link
Contributor Author

The CI failure seems caused by flaky tests, unreleated to this, I've filed a issue on that: #3908

@JingsongLi
Copy link
Contributor

Hi @zhongyujiang , can you rebase master?

Rebased.

Can this support when table is empty?

I guess this is a typo, what you mean is 'non-empty'? It appears that the bucket key and primary key are also projected based on the column name mapping, so renaming them does not affect the reading of old data. For this, I have added an end-to-end test testRenamePrimaryKey .

the bucket key and primary key are also projected based on the column name mapping.

It is true, but this is very dangerous and may lead to many bugs.

@zhongyujiang
Copy link
Contributor Author

It is true, but this is very dangerous and may lead to many bugs.

Hi @JingsongLi Could you share your specific concerns? From what I've observed, renaming a primary key doesn't seem to differ from renaming other keys; it appears to be safe.

I believe renaming the primary key is a necessary part of full schema evolution, and we have such a requirement ourselves. So if you have any concerns, I would be more than happy to investigate further and try to address them. Thank you!

@JingsongLi JingsongLi closed this Oct 30, 2024
@JingsongLi JingsongLi reopened this Oct 30, 2024
@JingsongLi
Copy link
Contributor

re-open to trigger tests.

@JingsongLi JingsongLi force-pushed the support-pk-and-bucket-key-rename branch from 71d7d27 to 9ca3ff8 Compare October 30, 2024 08:52
@JingsongLi
Copy link
Contributor

+1

@JingsongLi JingsongLi merged commit 97b9b33 into apache:master Oct 30, 2024
12 checks passed
@zhongyujiang zhongyujiang deleted the support-pk-and-bucket-key-rename branch November 6, 2024 09:23
hang8929201 pushed a commit to hang8929201/paimon that referenced this pull request Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants