Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] spark integrates paimon, and the write empty array becomes an array containing null #4790

Closed
1 of 2 tasks
xyk0930 opened this issue Dec 27, 2024 · 10 comments
Closed
1 of 2 tasks
Labels
bug Something isn't working

Comments

@xyk0930
Copy link

xyk0930 commented Dec 27, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

0.9

Compute Engine

spark3.5.1

Minimal reproduce step

  1. Spark generates a dataset where the chronic_list column is of type List
  2. Call dataset.show(), empty array display []
  3. But when I call dataset.write().mode(SaveMode.Append).format("paimon").save(path) to a piamon table, the query shows [null]

What doesn't meet your expectations?

空数组应该是[]而不是[null]

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@xyk0930 xyk0930 added the bug Something isn't working label Dec 27, 2024
@Aiden-Dong
Copy link
Contributor

Could you provide me with a sample test code? I want to reproduce this situation.

@Aiden-Dong
Copy link
Contributor

Aiden-Dong commented Dec 27, 2024

your table store format is orc or parquet ?

@xyk0930
Copy link
Author

xyk0930 commented Dec 27, 2024

parquet and use zstd file compression

@Aiden-Dong
Copy link
Contributor

Provide a sample test?

@xyk0930
Copy link
Author

xyk0930 commented Dec 27, 2024

1.create a paimon table
CREATE TABLE paimon_default.default_array_test
(
id INT COMMENT '唯一标识',
name STRING COMMENT '姓名',
tags ARRAY COMMENT '标签'
)
USING paimon
COMMENT '数组test'
TBLPROPERTIES (
'bucket' = '-1',
'changelog-producer' = 'none',
'deletion-vectors.enabled' = 'false',
'dynamic-bucket.initial-buckets' = '10',
'dynamic-bucket.target-row-num' = '2000000',
'file.compression' = 'zstd',
'file.compression.zstd-level' = '1',
'file.format' = 'parquet',
'full-compaction.delta-commits' = '1',
'ignore-delete' = 'false',
'merge-engine' = 'deduplicate',
'path' = 'hdfs://hadoop105:8020/paimon/warehouse/paimon_default.db/default_array_test',
'primary-key' = 'id',
'snapshot.expire.limit' = '10',
'snapshot.num-retained.max' = '10',
'snapshot.num-retained.min' = '3',
'snapshot.time-retained' = '1 h',
'tag.num-retained-max' = '7')
;
2.use spark write data into paimon table
public class PaimonArrayTest {

public static void main(String[] args) {
    // 初始化 SparkSession 并启用 Hive 支持
    SparkSession spark = SparkSession.builder()
            .appName("Spark Paimon Example")
            .config("spark.sql.catalog.spark_catalog", "org.apache.paimon.spark.SparkGenericCatalog")
            .config("spark.sql.extensions", "org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions")
            .enableHiveSupport()
            .getOrCreate();

    // 定义 Schema
    StructType schema = new StructType(new StructField[]{
            DataTypes.createStructField("id", DataTypes.IntegerType, false),
            DataTypes.createStructField("name", DataTypes.StringType, false),
            DataTypes.createStructField("tags", DataTypes.createArrayType(DataTypes.StringType), false)
    });

    // 创建一些数据行
    List<Row> data = Arrays.asList(
            RowFactory.create(1, "Alice", Arrays.asList("Java", "Scala")),
            RowFactory.create(2, "Bob", Arrays.asList("Python", "R")),
            RowFactory.create(3, "Charlie", new ArrayList<>())
    );

    // 创建 DataFrame
    Dataset<Row> df = spark.createDataFrame(data, schema);

    // 显示 DataFrame 内容
    df.show();

    // 将 DataFrame 写入 paimon 表
    df.write().mode(SaveMode.Overwrite).format("paimon").save("hdfs://hadoop105:8020/paimon/warehouse/paimon_default.db/default_array_test"); // 替换为实际的数据库和表名

    spark.table("paimon_default.default_array_test").show();
    // 关闭 SparkSession
    spark.stop();
}

}
3.driver stdout
image
@Aiden-Dong this is a sample test

@xyk0930
Copy link
Author

xyk0930 commented Dec 27, 2024

Write a piece of data using sparksql
insert into table paimon_default.default_array_test select 4,'Array',array();
The same goes for query results
image

@Aiden-Dong
Copy link
Contributor

I will try to fix this issue.

@Aiden-Dong
Copy link
Contributor

I couldn't reproduce this issue on the master branch. You can try a higher version.

@xyk0930
Copy link
Author

xyk0930 commented Dec 27, 2024

Ok, I'm waiting for the 1.0 release

@xyk0930
Copy link
Author

xyk0930 commented Dec 30, 2024

1.0 release Fixed the bug

@xyk0930 xyk0930 closed this as completed Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants