Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Flaky Test SpecificClusterManagerNodesIT.testElectOnlyBetweenClusterManagerNodes #16021

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kkewwei
Copy link
Contributor

@kkewwei kkewwei commented Sep 20, 2024

Description

The case is as follows:

  1. When the node_t1 is excluded from the vote config, and the cluster starts a new leader election, but the the node node_t2 hasn't been elected as the new leader.

  2. At the moment, we send request to get the ClusterManager, we first get ClusterManager name, and leads to the NullPointerException.

internalCluster().nonClusterManagerClient()-> ......->getClusterManagerName()

return client.admin().cluster().prepareState().get().getState().nodes().getClusterManagerNode().getName();

Related Issues

Resolves #15944 #16015

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❕ Gradle check result for 1b3920b: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@msfroh
Copy link
Collaborator

msfroh commented Sep 20, 2024

Checking the history of this test, has this been flaky for more than a year?

The only code related to cluster manager election that I've been able to find changed more recently than ~5 years ago (besides renaming) is the introduction of DecommisionService in 2023.

@kkewwei kkewwei force-pushed the fix_16015 branch 2 times, most recently from 1338c91 to 3823b1c Compare September 21, 2024 05:30
@kkewwei kkewwei changed the title Fix Flaky Test SpecificClusterManagerNodesIT.testElectOnlyBetweenClus… Fix Flaky Test SpecificClusterManagerNodesIT.testElectOnlyBetweenClusterManagerNodes Sep 21, 2024
@kkewwei
Copy link
Contributor Author

kkewwei commented Sep 21, 2024

Checking the history of this test, has this been flaky for more than a year?

The only code related to cluster manager election that I've been able to find changed more recently than ~5 years ago (besides renaming) is the introduction of DecommisionService in 2023.

@msfroh I'm not clearly why it wasn’t reported before, maybe the frequency of occurrence is very low and we ignored it? This issue #15944 collects some of the recent case.

Logically speaking, when the cluster has no cluster manager, It's right to return null from getClusterManagerNode, which will hit the case.

Copy link
Contributor

✅ Gradle check result for 4d812e9: SUCCESS

Copy link

codecov bot commented Sep 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.26%. Comparing base (f9c239d) to head (40548ec).

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16021      +/-   ##
============================================
+ Coverage     72.11%   72.26%   +0.14%     
- Complexity    65260    65320      +60     
============================================
  Files          5301     5301              
  Lines        303801   303801              
  Branches      44029    44029              
============================================
+ Hits         219098   219539     +441     
+ Misses        66686    66223     -463     
- Partials      18017    18039      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kkewwei
Copy link
Contributor Author

kkewwei commented Sep 24, 2024

@msfroh @gaobinlong please help confirm it when you are free.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Nov 6, 2024
return null;
}
};
assertThat(clusterManagerName.get(), equalTo(nextClusterManagerEligableNodeName));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we swallow the NPE with the try/catch above, won't this test still fail on the assertion here since clusterManagerName.get() would be null and nextClusterManagerEligableNodeName shouldn't be null?

Copy link
Contributor Author

@kkewwei kkewwei Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jed326 assertBusy allows the assert failed in 10s, so it seems ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add an additional assert before this to evaluate it is not null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertBusy can retry in 10s, it seems that we shouldn't add a new assertBusy before it.

@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Nov 7, 2024
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 13, 2024
Copy link
Contributor

@rajiv-kv rajiv-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kkewwei for raising a fix

.nodes()
.getClusterManagerNode()
.getName();
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of catching a generic exception, could we catch a specific exception related to ClusterManager Not found.

return null;
}
};
assertThat(clusterManagerName.get(), equalTo(nextClusterManagerEligableNodeName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add an additional assert before this to evaluate it is not null.

@kkewwei kkewwei force-pushed the fix_16015 branch 2 times, most recently from d2c4d60 to 4e3feaa Compare January 12, 2025 05:51
Copy link
Contributor

❕ Gradle check result for 4e3feaa: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

✅ Gradle check result for b1ba756: SUCCESS

@kkewwei
Copy link
Contributor Author

kkewwei commented Jan 15, 2025

@rajiv-kv, please have a look when you are free.

@@ -2171,7 +2172,7 @@ public String getClusterManagerName(@Nullable String viaNode) {
return client.admin().cluster().prepareState().get().getState().nodes().getClusterManagerNode().getName();
} catch (Exception e) {
Copy link
Contributor

@rajiv-kv rajiv-kv Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a seperate catch block for ClusterManagerNotDiscoveredException and not change the generic exception handler ?

Alternatively we can avoid this change and check for e.getCause at the caller as the exception is wrapped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajiv-kv I think when null is detected, we should take the initiative to handle it, how about this fix?

Copy link
Contributor

❕ Gradle check result for 40548ec: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.test.rest.ClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals}

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut backport 2.x Backport to 2.x branch bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for SpecificClusterManagerNodesIT
5 participants