Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] The manifest schema should not be nullable #3962

Merged
merged 1 commit into from
Aug 15, 2024

Conversation

ulysses-you
Copy link
Contributor

Purpose

The manifest schema should not be nullable, otherwise it will be wrapped with union null in avro which can not be read from Spark.

select * from
`avro`.`/Users/cathy/Desktop/tmp/spark-3.5.1-bin-hadoop3/warehouse/default.db/t/manifest/manifest-list-f1f00e99-398c-43e4-846f-6d88dad0aeb7-1`;

before

Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type ["null",{"type":"record","name":"record","namespace":"org.apache.paimon.avro.generated","fields":[{"name":"_VERSION","type":"int"},{"name":"_KIND","type":"int"},{"name":"_PARTITION","type":"bytes"},{"name":"_BUCKET","type":"int"},{"name":"_TOTAL_BUCKETS","type":"int"},{"name":"_FILE","type":["null",{"type":"record","name":"record__FILE","fields":[{"name":"_FILE_NAME","type":"string"},{"name":"_FILE_SIZE","type":"long"},{"name":"_ROW_COUNT","type":"long"},{"name":"_MIN_KEY","type":"bytes"},{"name":"_MAX_KEY","type":"bytes"},{"name":"_KEY_STATS","type":["null",{"type":"record","name":"record__FILE__KEY_STATS","fields":[{"name":"_MIN_VALUES","type":"bytes"},{"name":"_MAX_VALUES","type":"bytes"},{"name":"_NULL_COUNTS","type":["null",{"type":"array","items":["null","long"]}],"default":null}]}],"default":null},{"name":"_VALUE_STATS","type":["null",{"type":"record","name":"record__FILE__VALUE_STATS","fields":[{"name":"_MIN_VALUES","type":"bytes"},{"name":"_MAX_VALUES","type":"bytes"},{"name":"_NULL_COUNTS","type":["null",{"type":"array","items":["null","long"]}],"default":null}]}],"default":null},{"name":"_MIN_SEQUENCE_NUMBER","type":"long"},{"name":"_MAX_SEQUENCE_NUMBER","type":"long"},{"name":"_SCHEMA_ID","type":"long"},{"name":"_LEVEL","type":"int"},{"name":"_EXTRA_FILES","type":{"type":"array","items":"string"}},{"name":"_CREATION_TIME","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"default":null},{"name":"_DELETE_ROW_COUNT","type":["null","long"],"default":null},{"name":"_EMBEDDED_FILE_INDEX","type":["null","bytes"],"default":null},{"name":"_FILE_SOURCE","type":["null","int"],"default":null}]}],"default":null}]}] to SQL type STRUCT<_VERSION: INT, _KIND: INT, _PARTITION: BINARY, _BUCKET: INT, _TOTAL_BUCKETS: INT, _FILE: STRUCT<_FILE_NAME: STRING, _FILE_SIZE: BIGINT, _ROW_COUNT: BIGINT, _MIN_KEY: BINARY, _MAX_KEY: BINARY, _KEY_STATS: STRUCT<_MIN_VALUES: BINARY, _MAX_VALUES: BINARY, _NULL_COUNTS: ARRAY<BIGINT>>, _VALUE_STATS: STRUCT<_MIN_VALUES: BINARY, _MAX_VALUES: BINARY, _NULL_COUNTS: ARRAY<BIGINT>>, _MIN_SEQUENCE_NUMBER: BIGINT, _MAX_SEQUENCE_NUMBER: BIGINT, _SCHEMA_ID: BIGINT, _LEVEL: INT, _EXTRA_FILES: ARRAY<STRING>, _CREATION_TIME: TIMESTAMP, _DELETE_ROW_COUNT: BIGINT, _EMBEDDED_FILE_INDEX: BINARY, _FILE_SOURCE: INT>>.

after

2   manifest-9f355fcf-f2e1-41f0-9cb0-0116dfbfcc5f-0 1856    1   0   {"_MIN_VALUES":,"_MAX_VALUES":,"_NULL_COUNTS":[]}   0

Tests

Pass CI

API and Format

no

Documentation

@ulysses-you
Copy link
Contributor Author

cc @JingsongLi I'm not sure if there is any context about the nullability of schema, it seems the manifest record can not be null ?

@JingsongLi
Copy link
Contributor

Let me take a look to compatibility issue.
I am not sure old Paimon version can read these avro files.

@JingsongLi
Copy link
Contributor

In #1501 , I have refactor avro reader to read from non-union types. So it looks like we can get rid of top union now.

@ulysses-you Can you try to verify this in your local env? Use new version Paimon to write table, and use old version (maybe release 0.6) to read from this table.

@ulysses-you
Copy link
Contributor Author

@JingsongLi thank you for the confirming. I checked in my local and it works:

// new version manifest
select * from `avro`.`/Users/cathy/Desktop/tmp/spark-3.5.1-bin-hadoop3/warehouse/default.db/t/manifest/manifest-list-f1f00e99-398c-43e4-846f-6d88dad0aeb7-1`;
2	manifest-9f355fcf-f2e1-41f0-9cb0-0116dfbfcc5f-0	1856	1	0	{"_MIN_VALUES":,"_MAX_VALUES":,"_NULL_COUNTS":[]}	0

// old version manifest
select * from `avro`.`/Users/cathy/Desktop/tmp/spark-3.5.1-bin-hadoop3/warehouse/default.db/t/manifest/manifest-list-077e8e20-b52b-4258-9ad0-e764b7183ee5-1`;
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type ["null",{"type":"record","name":"record","namespace":"org.apache.paimon.avro.generated","fields":[{"name":"_VERSION","type":"int"},{"name":"_FILE_NAME","type":"string"},{"name":"_FILE_SIZE","type":"long"},{"name":"_NUM_ADDED_FILES","type":"long"},{"name":"_NUM_DELETED_FILES","type":"long"},{"name":"_PARTITION_STATS","type":["null",{"type":"record","name":"record__PARTITION_STATS","fields":[{"name":"_MIN_VALUES","type":"bytes"},{"name":"_MAX_VALUES","type":"bytes"},{"name":"_NULL_COUNTS","type":["null",{"type":"array","items":["null","long"]}],"default":null}]}],"default":null},{"name":"_SCHEMA_ID","type":"long"}]}] to SQL type STRUCT<_VERSION: INT, _FILE_NAME: STRING, _FILE_SIZE: BIGINT, _NUM_ADDED_FILES: BIGINT, _NUM_DELETED_FILES: BIGINT, _PARTITION_STATS: STRUCT<_MIN_VALUES: BINARY, _MAX_VALUES: BINARY, _NULL_COUNTS: ARRAY<BIGINT>>, _SCHEMA_ID: BIGINT>.

// work together (checked with 0.6.0/master without this pr/master with this pr)
SELECT * FROM t;
1	1	a
2	2	b

SELECT * FROM t VERSION AS OF 1;
1	1	a

SELECT * FROM t VERSION AS OF 2;
1	1	a
2	2	b

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi
Copy link
Contributor

Thanks @ulysses-you , merging...

@JingsongLi JingsongLi merged commit e4d4bdb into apache:master Aug 15, 2024
10 checks passed
@ulysses-you ulysses-you deleted the schema-nullable branch August 15, 2024 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants