-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] IO Freeze on Leader cause cluster publication to get stuck #1165
Comments
Can you add the steps to reproduce this issue or write the test case for the same? |
I don't see the FS Health failing fixing this problem as once the health check fails, the |
Fs health checks does pro-active checks to identify a bad node and evict the same from the cluster, rather than waiting for a cluster state update to remove the stuck leader. You rightly pointed out that this fix itself is insufficient since the mutex for |
Hi Bukhtawar, are you actively working on this issue? I see a pull request #1167 already submitted. anything else is pending before we can close the issue? |
We don't seem to have any metrics on how frequently this issue manifests, but it appears to happen when the filesystem in use is not a local disk (i.e. EBS). |
@Bukhtawar is there anything else pending on this issue before we close it? |
Describe the bug
The publication of cluster state is time bound to 30s by a
cluster.publish.timeout
settings. If this time is reached before the new cluster state is committed then the cluster state change is rejected and the leader considers itself to have failed. It stands down and starts trying to elect a new master.There is a bug in leader that when it tries to publish the new cluster state it first tries acquire a lock(
0x0000000097a2f970
) to flush the new state under a mutex to disk. The same lock(0x0000000097a2f970
) is used to cancel the publication on timeout. Below is the state of the timeout scheduler meant to cancel the publication. So essentially if the flushing of cluster state is stuck on IO, so will the cancellation of the publication since both of them share the same mutex. So leader will not step down and effectively block the cluster from making progressFS Health checks at this point
Leader trying to commit the new cluster state to disk causing other operations to be stalled on the same mutex.
Note other processing like Follower Checker remove node is stuck on the same mutex
Proposal
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: