Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [improve] Flink TM will shutdown in specific case. #270

Closed
1 of 2 tasks
Vipamp opened this issue Dec 25, 2024 · 3 comments
Closed
1 of 2 tasks

[Bug] [improve] Flink TM will shutdown in specific case. #270

Vipamp opened this issue Dec 25, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@Vipamp
Copy link
Contributor

Vipamp commented Dec 25, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

0.6-SNAPSHOT

Minimal reproduce step

Deleting a table which is writing by flink will cause many exceptions in TaskManager, when user cancal this flink job, the job will be in CANCELLING status until TaskManager shutdown.

  1. Create a fluss table and then write data into this table continuously by flink job.
  2. Drop table.
  3. Each records to be written to fluss will throw an exception on TaskManager log.
  4. Cancel the job on flink dashboard, the job will be in CANCELLING status until TaskManager shutdown.

What doesn't meet your expectations?

No.

Anything else?

  1. When there is a high volumn of real-time data to be written, the TM logs is very large, it's is necessary to improve.
  2. In some cases, TM will be shutdown, it is in danger.
  3. Ideally, when in this case, flink job should to be failed.

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@Vipamp Vipamp added the bug Something isn't working label Dec 25, 2024
@wuchong
Copy link
Member

wuchong commented Dec 26, 2024

Is there any error logs in TM or JM?

@xiongmozhou
Copy link

xiongmozhou commented Dec 26, 2024

2024-12-26 08:46:31,378 WARN com.alibaba.fluss.client.write.Sender [] - Get error write response on table bucket TableBucket{tableId=5, bucket=0}, retrying (2147360609 attempts left). Error: OUT_OF_ORDER_SEQUENCE_EXCEPTION. Error Message: Out of order batch sequence for writer 4 at offset 7050 in table-bucket TableBucket{tableId=5, bucket=0} : 1 (incoming batch seq.), -1 (current batch seq.)
2024-12-26 08:46:31,380 WARN com.alibaba.fluss.client.write.Sender [] - Get error write response on table bucket TableBucket{tableId=5, bucket=0}, retrying (2147360608 attempts left). Error: OUT_OF_ORDER_SEQUENCE_EXCEPTION. Error Message: Out of order batch sequence for writer 4 at offset 7050 in table-bucket TableBucket{tableId=5, bucket=0} : 1 (incoming batch seq.), -1 (current batch seq.)
2024-12-26 08:46:31,381 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Terminating TaskManagerRunner with exit code 1.
org.apache.flink.util.FlinkException: Unexpected failure during runtime of TaskManagerRunner.
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:503) ~[flink-dist-1.20.0.jar:1.20.0]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$5(TaskManagerRunner.java:537) ~[flink-dist-1.20.0.jar:1.20.0]
at java.security.AccessController.doPrivileged(Unknown Source) ~[?:?]
at javax.security.auth.Subject.doAs(Unknown Source) ~[?:?]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) ~[flink-shaded-hadoop-2-uber-2.8.3-10.0.jar:2.8.3-10.0]
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) ~[flink-dist-1.20.0.jar:1.20.0]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:537) [flink-dist-1.20.0.jar:1.20.0]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:517) [flink-dist-1.20.0.jar:1.20.0]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:475) [flink-dist-1.20.0.jar:1.20.0]
Caused by: java.util.concurrent.TimeoutException: Waiting for TaskManager shutting down timed out after 10000 ms.
at org.apache.flink.util.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1113) ~[flink-dist-1.20.0.jar:1.20.0]
at org.apache.flink.util.concurrent.Executors$DirectExecutor.execute(Executors.java:60) ~[flink-dist-1.20.0.jar:1.20.0]
at org.apache.flink.util.concurrent.FutureUtils.lambda$orTimeout$12(FutureUtils.java:457) ~[flink-dist-1.20.0.jar:1.20.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
2024-12-26 08:46:31,382 INFO org.apache.flink.runtime.state.TaskExecutorFileMergingManager [] - Shutting down TaskExecutorFileMergingManager.
2024-12-26 08:46:31,383 WARN com.alibaba.fluss.client.write.Sender [] - Get error write response on table bucket TableBucket{tableId=5, bucket=0}, retrying (2147360607 attempts left). Error: OUT_OF_ORDER_SEQUENCE_EXCEPTION. Error Message: Out of order batch sequence for writer 4 at offset 7050 in table-bucket TableBucket{tableId=5, bucket=0} : 1 (incoming batch seq.), -1 (current batch seq.)
2024-12-26 08:46:31,383 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
2024-12-26 08:46:31,383 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Shu

@luoyuxia
Copy link
Collaborator

The OUT_OF_ORDER_SEQUENCE_EXCEPTION should be fixed in main branch.. Close it.. Feel free to open it if it still happens..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants