-
Notifications
You must be signed in to change notification settings - Fork 988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] fix parquet can not read empty row with first column is array. #4711
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your effort @Stephen0421! I tried the fix locally by reading some data with highly nested schemas. Some tests passed and some failed with the following exception. I'll share the table files with you for debugging.
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
at org.apache.paimon.data.columnar.heap.AbstractHeapVector.isNullAt(AbstractHeapVector.java:111)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:144)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readToVector(NestedColumnReader.java:90)
at org.apache.paimon.format.parquet.ParquetReaderFactory$ParquetReader.nextBatch(ParquetReaderFactory.java:406)
@@ -834,11 +840,14 @@ private Path createNestedDataByOriginWriter(int rowNum, File tmpDir, int rowGrou | |||
MessageType schema = | |||
ParquetSchemaConverter.convertToParquetMessageType( | |||
"paimon-parquet", NESTED_ARRAY_MAP_TYPE); | |||
String[] candidates = new String[] {"snappy", "zstd", "gzip"}; | |||
String compress = candidates[new Random().nextInt(3)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was intended to balance coverage and test running time, but using random in test cases is usually not ideal. Let's use a parameterized test here for these three formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to reply so late, this is refer to the previous pr, i will change it to the parameterized test
HeapBytesVector phbv = new HeapBytesVector(total, isNull); | ||
return new ParquetDecimalVector(phbv, total); | ||
} | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if all the existing types are covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes except the INT32 and INT64, other primitiveType should deserialize as HeapBytesVector
@@ -106,6 +107,8 @@ public class NestedPrimitiveColumnReader implements ColumnReader<WritableColumnV | |||
|
|||
private boolean isFirstRow = true; | |||
|
|||
private boolean cutLevel = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment for this boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
fa4cd3a
to
e0554db
Compare
…is the same as other child vector.
Very thanks @Stephen0421 , but we still found some corner cases, we decided to revert old changes. see: #4745 |
Close this one now. |
Purpose
Linked issue: close #4710
Tests
add unit case and test in local.
API and Format
Parquet
Documentation
No