You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When there are too many cluster manager nodes (5 and above) and the size of cluster state is large, if the active cluster manager node drops due to any reason, the cluster gets stuck in an election loop.
Explanation: When there are 5 cluster manager nodes, if the active cluster manager drops, the remaining 4 cluster manager nodes start the election within 100ms (initial timeout) of each other. When the first election manages to get the quorum, it sets the node as leader and cancels the election scheduler. While the cluster state is being computed, the next election also succeeds as there are still 3 remaining nodes which can vote for a quorum. This next election increments the term due to which the previously elected leader steps down again and restarts the election again. This cycle gets repeated without any backoff as a new election scheduler is created every time.
Related component
Cluster Manager
To Reproduce
Create a cluster with 5 cluster manager nodes
Create a lot of indices so that the cluster state computation increases to around 1 sec.
Kill the active cluster manager node
See error of the remaining cluster manager nodes where the election succeeds but the cluster state publication fails
FailedToCommitClusterStateException[node is no longer cluster-manager for term 673682 while handling publication]
at org.opensearch.cluster.coordination.Coordinator.publish(Coordinator.java:1280)
Expected behavior
When the active cluster manager node drops, the next election should happen immediately. In case the election does not succeed in publishing the cluster state, it should not cancel the election scheduler but start the next election after a backoff time.
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
OS 2.11
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Describe the bug
When there are too many cluster manager nodes (5 and above) and the size of cluster state is large, if the active cluster manager node drops due to any reason, the cluster gets stuck in an election loop.
Explanation: When there are 5 cluster manager nodes, if the active cluster manager drops, the remaining 4 cluster manager nodes start the election within 100ms (initial timeout) of each other. When the first election manages to get the quorum, it sets the node as leader and cancels the election scheduler. While the cluster state is being computed, the next election also succeeds as there are still 3 remaining nodes which can vote for a quorum. This next election increments the term due to which the previously elected leader steps down again and restarts the election again. This cycle gets repeated without any backoff as a new election scheduler is created every time.
Related component
Cluster Manager
To Reproduce
Expected behavior
When the active cluster manager node drops, the next election should happen immediately. In case the election does not succeed in publishing the cluster state, it should not cancel the election scheduler but start the next election after a backoff time.
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
OS 2.11
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: