-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Flaky Test SpecificClusterManagerNodesIT.testElectOnlyBetweenClusterManagerNodes #16021
base: main
Are you sure you want to change the base?
Conversation
❕ Gradle check result for 1b3920b: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Checking the history of this test, has this been flaky for more than a year? The only code related to cluster manager election that I've been able to find changed more recently than ~5 years ago (besides renaming) is the introduction of DecommisionService in 2023. |
1338c91
to
3823b1c
Compare
@msfroh I'm not clearly why it wasn’t reported before, maybe the frequency of occurrence is very low and we ignored it? This issue #15944 collects some of the recent case. Logically speaking, when the cluster has no cluster manager, It's right to return null from |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #16021 +/- ##
============================================
+ Coverage 72.11% 72.26% +0.14%
- Complexity 65260 65320 +60
============================================
Files 5301 5301
Lines 303801 303801
Branches 44029 44029
============================================
+ Hits 219098 219539 +441
+ Misses 66686 66223 -463
- Partials 18017 18039 +22 ☔ View full report in Codecov by Sentry. |
@msfroh @gaobinlong please help confirm it when you are free. |
This PR is stalled because it has been open for 30 days with no activity. |
return null; | ||
} | ||
}; | ||
assertThat(clusterManagerName.get(), equalTo(nextClusterManagerEligableNodeName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we swallow the NPE with the try/catch above, won't this test still fail on the assertion here since clusterManagerName.get()
would be null
and nextClusterManagerEligableNodeName
shouldn't be null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jed326 assertBusy
allows the assert failed in 10s, so it seems ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add an additional assert before this to evaluate it is not null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertBusy
can retry in 10s, it seems that we shouldn't add a new assertBusy
before it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kkewwei for raising a fix
server/src/internalClusterTest/java/org/opensearch/cluster/SpecificClusterManagerNodesIT.java
Outdated
Show resolved
Hide resolved
.nodes() | ||
.getClusterManagerNode() | ||
.getName(); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of catching a generic exception, could we catch a specific exception related to ClusterManager Not found.
return null; | ||
} | ||
}; | ||
assertThat(clusterManagerName.get(), equalTo(nextClusterManagerEligableNodeName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add an additional assert before this to evaluate it is not null.
d2c4d60
to
4e3feaa
Compare
❕ Gradle check result for 4e3feaa: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
@rajiv-kv, please have a look when you are free. |
@@ -2171,7 +2172,7 @@ public String getClusterManagerName(@Nullable String viaNode) { | |||
return client.admin().cluster().prepareState().get().getState().nodes().getClusterManagerNode().getName(); | |||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a seperate catch block for ClusterManagerNotDiscoveredException
and not change the generic exception handler ?
Alternatively we can avoid this change and check for e.getCause
at the caller as the exception is wrapped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rajiv-kv I think when null is detected, we should take the initiative to handle it, how about this fix?
…terManagerNodes Signed-off-by: kkewwei <[email protected]> Signed-off-by: kkewwei <[email protected]>
❕ Gradle check result for 40548ec: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Description
The case is as follows:
When the
node_t1
is excluded from the vote config, and the cluster starts a new leader election, but the the nodenode_t2
hasn't been elected as the new leader.At the moment, we send request to get the ClusterManager, we first get ClusterManager name, and leads to the NullPointerException.
internalCluster().nonClusterManagerClient()
-> ......->getClusterManagerName()
OpenSearch/test/framework/src/main/java/org/opensearch/test/InternalTestCluster.java
Line 2171 in f0ea056
Related Issues
Resolves #15944 #16015
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.