[Feature] When writing using Spark, support for write parallelism #4742

huyuanfeng2018 · 2024-12-20T03:00:31Z

Search before asking

I searched in the issues and found nothing similar.

Motivation

Background

There exists a table that satisfies the following conditions:

The number of buckets is large
The amount of data is huge
changelog=lookup

I use Spark sql (batch) to read the hive table and write data into it. If the amount of data I read is not large, only less parallelism p1 will be generated during the split phase of spark scan.

The final write to paimon will also use p1 parallelism:

This will cause problems. The degree of parallelism is too small. When writing, building lookupcache will be blocked by io, which will cause the task to run for a long time even if only a little data is written.

Solution

It is necessary to be able to set the parallelism of writing, and at the same time to automatically infer the number of buckets. The default setting of writing parallelism is the number of buckets.

Anything else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

huyuanfeng2018 added the enhancement New feature or request label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] When writing using Spark, support for write parallelism #4742

[Feature] When writing using Spark, support for write parallelism #4742

huyuanfeng2018 commented Dec 20, 2024

[Feature] When writing using Spark, support for write parallelism #4742

[Feature] When writing using Spark, support for write parallelism #4742

Comments

huyuanfeng2018 commented Dec 20, 2024

Search before asking

Motivation

Background

Solution

Anything else?

Are you willing to submit a PR?