Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] HashBucketAssigner load index lead to Too large error #3776

Closed
1 of 2 tasks
izhangzhihao opened this issue Jul 18, 2024 · 1 comment
Closed
1 of 2 tasks

[Bug] HashBucketAssigner load index lead to Too large error #3776

izhangzhihao opened this issue Jul 18, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@izhangzhihao
Copy link

izhangzhihao commented Jul 18, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

0.9.0

Compute Engine

flink 1.17.2

Minimal reproduce step

CREATE TABLE if not exists a_one_billion_table (
    id STRING,
    ... ...
    PRIMARY KEY (id) NOT ENFORCED
) WITH ('bucket' = '-1');

# make sure you already inserted more then 1 billion data into table `a_one_billion_table`
# then run the blow new filnk job, `source_table` may only have 10000 records, then the checkpoint will fail.

insert into a_one_billion_table
select * from source_table;

What doesn't meet your expectations?

checkpoint failed with error:

2024-07-18 16:09:32,895 WARN  org.apache.flink.runtime.taskmanager.Task [] - dynamic-bucket-assigner (1/1)#0 switched from RUNNING to FAILED with failure cause:
java.lang.IllegalArgumentException: Too large (1466616922 expected elements with load factor 0.75)
	at org.apache.paimon.shade.it.unimi.dsi.fastutil.HashCommon.arraySize(HashCommon.java:208)
	at org.apache.paimon.shade.it.unimi.dsi.fastutil.ints.Int2ShortOpenHashMap.<init>(Int2ShortOpenHashMap.java:103)
	at org.apache.paimon.shade.it.unimi.dsi.fastutil.ints.Int2ShortOpenHashMap.<init>(Int2ShortOpenHashMap.java:116)
	at org.apache.paimon.utils.Int2ShortHashMap.<init>(Int2ShortHashMap.java:35)
	at org.apache.paimon.utils.Int2ShortHashMap$Builder.build(Int2ShortHashMap.java:70)
	at org.apache.paimon.index.PartitionIndex.loadIndex(PartitionIndex.java:138)
	at org.apache.paimon.index.HashBucketAssigner.loadIndex(HashBucketAssigner.java:166)
	at org.apache.paimon.index.HashBucketAssigner.assign(HashBucketAssigner.java:83)
	at org.apache.paimon.flink.sink.HashBucketAssignerOperator.processElement(HashBucketAssignerOperator.java:98)
	at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:246)
	at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:217)
	at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:169)
	at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:68)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:616)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:1080)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:1029)
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:959)
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:938)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:751)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:567)
	at java.lang.Thread.run(Thread.java:879) [?:1.8.0_372]

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@izhangzhihao
Copy link
Author

pls reopen this ticket, this issue is still not resolved. adding parallelism is a workaround. Ideally, the resource allocation for the task should match the data flow, rather than matching the total amount of data at rest. The current situation does not meet this ideal. Is there any plan for optimization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants