Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] fix parquet can not read empty row with first column is array. #4711

Closed
wants to merge 2 commits into from

Conversation

Stephen0421
Copy link
Contributor

Purpose

Linked issue: close #4710

Tests

add unit case and test in local.

API and Format

Parquet

Documentation

No

Copy link
Member

@xccui xccui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your effort @Stephen0421! I tried the fix locally by reading some data with highly nested schemas. Some tests passed and some failed with the following exception. I'll share the table files with you for debugging.

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
	at org.apache.paimon.data.columnar.heap.AbstractHeapVector.isNullAt(AbstractHeapVector.java:111)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:144)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
	at org.apache.paimon.format.parquet.reader.NestedColumnReader.readToVector(NestedColumnReader.java:90)
	at org.apache.paimon.format.parquet.ParquetReaderFactory$ParquetReader.nextBatch(ParquetReaderFactory.java:406)

@@ -834,11 +840,14 @@ private Path createNestedDataByOriginWriter(int rowNum, File tmpDir, int rowGrou
MessageType schema =
ParquetSchemaConverter.convertToParquetMessageType(
"paimon-parquet", NESTED_ARRAY_MAP_TYPE);
String[] candidates = new String[] {"snappy", "zstd", "gzip"};
String compress = candidates[new Random().nextInt(3)];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was intended to balance coverage and test running time, but using random in test cases is usually not ideal. Let's use a parameterized test here for these three formats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to reply so late, this is refer to the previous pr, i will change it to the parameterized test

HeapBytesVector phbv = new HeapBytesVector(total, isNull);
return new ParquetDecimalVector(phbv, total);
}
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if all the existing types are covered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes except the INT32 and INT64, other primitiveType should deserialize as HeapBytesVector

@@ -106,6 +107,8 @@ public class NestedPrimitiveColumnReader implements ColumnReader<WritableColumnV

private boolean isFirstRow = true;

private boolean cutLevel = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment for this boolean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@JingsongLi
Copy link
Contributor

Very thanks @Stephen0421 , but we still found some corner cases, we decided to revert old changes. see: #4745

@JingsongLi
Copy link
Contributor

Close this one now.

@JingsongLi JingsongLi closed this Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] exception occur when reading empty row with first column is array in parquet.
3 participants