[spark] Support nested col pruning #4269

Zouxxyy · 2024-09-26T12:29:36Z

Purpose

to #4209, Support nested col pruning, e.g.

CREATE TABLE students (
    name STRING,
    age INT,
    course STRUCT<course_name: STRING, grade: DOUBLE>
) USING paimon;

SELECT course.grade FROM students;

will only obtain course.grade from colume-storage-format (parquet, orc)

Tests

API and Format

Documentation

JingsongLi

Are there some configuration to disable nested projection? I am concern about bugs in nested projection, at least, we should have option to disable it.

Zouxxyy · 2024-10-08T03:30:37Z

Are there some configuration to disable nested projection? I am concern about bugs in nested projection, at least, we should have option to disable it.

yes, spark has a conf to enabled nestedSchemaPruning

  val NESTED_SCHEMA_PRUNING_ENABLED =
    buildConf("spark.sql.optimizer.nestedSchemaPruning.enabled")
      .internal()
      .doc("Prune nested fields from a logical relation's output which are unnecessary in " +
        "satisfying a query. This optimization allows columnar file format readers to avoid " +
        "reading unnecessary nested column data. Currently Parquet and ORC are the " +
        "data sources that implement this optimization.")
      .version("2.4.1")
      .booleanConf
      .createWithDefault(true)

JingsongLi · 2024-10-08T09:20:35Z

paimon-format/src/main/java/org/apache/paimon/format/parquet/ParquetReaderFactory.java

+            String fieldName = field.name();
+            if (parquetGroup.containsField(fieldName)) {
+                Type type = parquetGroup.getType(fieldName);
+                if (type instanceof GroupType && field.type() instanceof RowType) {


Can Spark push down nested fields for array and map?

It support, update

JingsongLi

+1

Zouxxyy force-pushed the dev/spark-nested branch 3 times, most recently from c5711df to c0fdb15 Compare September 26, 2024 17:38

JingsongLi reviewed Oct 8, 2024

View reviewed changes

Zouxxyy force-pushed the dev/spark-nested branch from c0fdb15 to 35e3fd0 Compare October 8, 2024 16:46

1

9fd330b

Zouxxyy force-pushed the dev/spark-nested branch from 35e3fd0 to 9fd330b Compare October 9, 2024 00:40

1

7f1294d

JingsongLi approved these changes Oct 9, 2024

View reviewed changes

JingsongLi merged commit 20a3967 into apache:master Oct 9, 2024
10 checks passed

Zouxxyy deleted the dev/spark-nested branch October 9, 2024 07:50

guluo2016 mentioned this pull request Dec 11, 2024

[Bug] select * T$binlog would throw ClassCastException on Spark #4686

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Support nested col pruning #4269

[spark] Support nested col pruning #4269

Zouxxyy commented Sep 26, 2024 •

edited

Loading

JingsongLi left a comment

Zouxxyy commented Oct 8, 2024 •

edited

Loading

JingsongLi Oct 8, 2024

Zouxxyy Oct 9, 2024

JingsongLi left a comment

[spark] Support nested col pruning #4269

[spark] Support nested col pruning #4269

Conversation

Zouxxyy commented Sep 26, 2024 • edited Loading

Purpose

Tests

API and Format

Documentation

JingsongLi left a comment

Choose a reason for hiding this comment

Zouxxyy commented Oct 8, 2024 • edited Loading

JingsongLi Oct 8, 2024

Choose a reason for hiding this comment

Zouxxyy Oct 9, 2024

Choose a reason for hiding this comment

JingsongLi left a comment

Choose a reason for hiding this comment

Zouxxyy commented Sep 26, 2024 •

edited

Loading

Zouxxyy commented Oct 8, 2024 •

edited

Loading