You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched in the issues and found nothing similar.
Motivation
Background
There exists a table that satisfies the following conditions:
The number of buckets is large
The amount of data is huge
changelog=lookup
I use Spark sql (batch) to read the hive table and write data into it. If the amount of data I read is not large, only less parallelism p1 will be generated during the split phase of spark scan.
The final write to paimon will also use p1 parallelism:
This will cause problems. The degree of parallelism is too small. When writing, building lookupcache will be blocked by io, which will cause the task to run for a long time even if only a little data is written.
Solution
It is necessary to be able to set the parallelism of writing, and at the same time to automatically infer the number of buckets. The default setting of writing parallelism is the number of buckets.
Anything else?
No response
Are you willing to submit a PR?
I'm willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
Motivation
Background
There exists a table that satisfies the following conditions:
The number of buckets is large
The amount of data is huge
changelog=lookup
I use Spark sql (batch) to read the hive table and write data into it. If the amount of data I read is not large, only less parallelism p1 will be generated during the split phase of spark scan.
The final write to paimon will also use p1 parallelism:
This will cause problems. The degree of parallelism is too small. When writing, building lookupcache will be blocked by io, which will cause the task to run for a long time even if only a little data is written.
Solution
It is necessary to be able to set the parallelism of writing, and at the same time to automatically infer the number of buckets. The default setting of writing parallelism is the number of buckets.
Anything else?
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: