Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] When writing using Spark, support for write parallelism #4742

Open
2 tasks done
huyuanfeng2018 opened this issue Dec 20, 2024 · 0 comments
Open
2 tasks done
Labels
enhancement New feature or request

Comments

@huyuanfeng2018
Copy link
Contributor

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Background

There exists a table that satisfies the following conditions:

  1. The number of buckets is large

  2. The amount of data is huge

  3. changelog=lookup

I use Spark sql (batch) to read the hive table and write data into it. If the amount of data I read is not large, only less parallelism p1 will be generated during the split phase of spark scan.

The final write to paimon will also use p1 parallelism:

image

This will cause problems. The degree of parallelism is too small. When writing, building lookupcache will be blocked by io, which will cause the task to run for a long time even if only a little data is written.

Solution

It is necessary to be able to set the parallelism of writing, and at the same time to automatically infer the number of buckets. The default setting of writing parallelism is the number of buckets.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@huyuanfeng2018 huyuanfeng2018 added the enhancement New feature or request label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant