[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

amberzsy · 2024-12-15T23:40:15Z

Describe the bug

the cluster has 3 master nodes and 50+ data nodes in OpenSearch cluster. During the network failure/high network degradation on master leader node, bunch of data nodes failed on master leader check and got "disconnected" with master leader. On master node side, those data nodes got excluded/removed from cluster due to the failure on follower check and failure on cluster state publish process. (note, master leader at this point, still processing, publishing logs and updating cluster state etc)
It further leads massive shard relocation or Red state in some extreme cases(60% data nodes marked as disconnected and removed by master).

Related component

Cluster Manager

To Reproduce

set up cluster with 3 master nodes (1 leader and 2 standby). and couple of data nodes.
trigger network degradation only on master leader node. (or trigger network layer packet drop etc) for more than 5mins.
check master leader and data nodes log if there's follower/leader check failures and data nodes starting get removed from master leader.

Expected behavior

Ideally, what would be expected is during network degradation/failures on Mater leader, it would automatically promote or elect one of the two standby to leader. However, it didn't happen.

We tried with other scenarios as mentioned below, and auto promotion is working properly.

trigger gracefully shutdown on master leader. The standby master-eligible node is able to be promoted
trigger ungracefully shutdown on leader (e.g kill -9 the master leader process while it's running). The standby master-eligible node is able to be promoted and keep running the cluster. Data nodes can update the leader info and re-communicate with newly elected leader.

Additional Details

Plugins
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-skills
opensearch-sql
prometheus-exporter
repository-gcs
repository-s3

Screenshots
N/A

Host/Environment (please complete the following information):

OS: Linux
Version 2.16.1

Additional context
N/A

amberzsy added bug Something isn't working untriaged labels Dec 15, 2024

github-actions bot added the Cluster Manager label Dec 15, 2024

github-project-automation bot added this to Cluster Manager Project Board Dec 15, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

amberzsy commented Dec 15, 2024 •

edited

Loading

[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

Comments

amberzsy commented Dec 15, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

amberzsy commented Dec 15, 2024 •

edited

Loading