Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix partition column generate wrong partition spec #4349

Merged
merged 2 commits into from
Oct 23, 2024

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Oct 18, 2024

Purpose

Paimon uses .toString to generate partition value, which is not accurate for some data types. like date/binary. Say, Spark engine would use a Cast to convert a partition object to string value. So this pr changes to use cast to generate partition value.

Add a new config partition.legacy-name to support switch to use previous toString behavior, and by default use the legacy behavior(.toString).

An example that using binary type partition column would cause failure.

CREATE TABLE pt (
    id BIGINT,
    c1 STRING
) using paimon
PARTITIONED BY (day binary);

insert into table pt values(1, 'a', cast('2021' as binary));
select * from pt;
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (192.168.0.102 executor driver): java.io.FileNotFoundException: File 'warehouse/default.db/pt/day=%5BB@4a045a11/bucket-0/data-91c064a3-a0a1-4042-9d5a-cc82a23af7ff-0.parquet' not found, Possible causes: 1.snapshot expires too fast, you can configure 'snapshot.time-retained' option with a larger value. 2.consumption is too slow, you can improve the performance of consumption (For example, increasing parallelism).

Tests

add test

API and Format

no

Documentation

added docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move cast related files from core to common due to the InternalRowPartitionComputer is at common.

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ulysses-you for the contribution.

But this is really dangerous to compatibility. It may be better to keep old style by default.

@ulysses-you
Copy link
Contributor Author

But this is really dangerous to compatibility. It may be better to keep old style by default.

@JingsongLi +1

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit a9f2d2c into apache:master Oct 23, 2024
13 checks passed
@ulysses-you ulysses-you deleted the part-value branch October 24, 2024 05:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants