Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

Open
amberzsy opened this issue Dec 15, 2024 · 0 comments
Labels
bug Something isn't working Cluster Manager untriaged

Comments

@amberzsy
Copy link

amberzsy commented Dec 15, 2024

Describe the bug

the cluster has 3 master nodes and 50+ data nodes in OpenSearch cluster. During the network failure/high network degradation on master leader node, bunch of data nodes failed on master leader check and got "disconnected" with master leader. On master node side, those data nodes got excluded/removed from cluster due to the failure on follower check and failure on cluster state publish process. (note, master leader at this point, still processing, publishing logs and updating cluster state etc)
It further leads massive shard relocation or Red state in some extreme cases(60% data nodes marked as disconnected and removed by master).

Related component

Cluster Manager

To Reproduce

  1. set up cluster with 3 master nodes (1 leader and 2 standby). and couple of data nodes.
  2. trigger network degradation only on master leader node. (or trigger network layer packet drop etc) for more than 5mins.
  3. check master leader and data nodes log if there's follower/leader check failures and data nodes starting get removed from master leader.

Expected behavior

Ideally, what would be expected is during network degradation/failures on Mater leader, it would automatically promote or elect one of the two standby to leader. However, it didn't happen.

We tried with other scenarios as mentioned below, and auto promotion is working properly.

  1. trigger gracefully shutdown on master leader. The standby master-eligible node is able to be promoted
  2. trigger ungracefully shutdown on leader (e.g kill -9 the master leader process while it's running). The standby master-eligible node is able to be promoted and keep running the cluster. Data nodes can update the leader info and re-communicate with newly elected leader.

Additional Details

Plugins
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-skills
opensearch-sql
prometheus-exporter
repository-gcs
repository-s3

Screenshots
N/A

Host/Environment (please complete the following information):

  • OS: Linux
  • Version 2.16.1

Additional context
N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager untriaged
Projects
Status: 🆕 New
Development

No branches or pull requests

1 participant