-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow AsyncTasks to indicate if they should not be scheduled during a shutdown #10860
Conversation
Fixes opensearch-project#10827 Signed-off-by: Peter Nied <[email protected]>
Signed-off-by: Peter Nied <[email protected]>
Signed-off-by: Peter Nied <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
@@ -64,6 +68,10 @@ public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) { | |||
} | |||
} | |||
rejected.inc(); | |||
if (executor.isTerminating() || executor.isTerminated() || executor.isShutdown()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peternied we should not do that: the listeners will be called on rejection to notify the caller that the submission failed, with this change it won't happen anymore, leaving some callbacks in dangling state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reta Thanks for the feedback, I picked an ugly issue to attempt to root cause and fix in the flaky test space - I might be making assumptions that are not well founded.
I understand from the programming purity - this is against the conventions of executors; however, aren't the only cases where we are shutting down executors when we are shutting down nodes? Is there a better way to determine that the cluster is going down so these sources of failures can be ignored?
My goal with this change is to stabilize the test infrastructure assuming this change passes CI, don't we have sufficient coverage to accept this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand from the programming purity - this is against the conventions of executors; however, aren't the only cases where we are shutting down executors when we are shutting down nodes?
It may be the case, but even than we need a clean shutdown, tracing comes immediately as an example here - shutting down the node should gracefully flush the data to collector but with this change, the trace spans won't be flushed since they will be left in dangling state.
My goal with this change is to stabilize the test infrastructure assuming this change passes CI, don't we have sufficient coverage to accept this change?
We should have coverage but we may end up well in a flaky state. But this is not really about coverage - we break the contract by not calling the callbacks when we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To revisit;
Is there a better way to determine that the cluster is going down so these sources of failures can be ignored?
I’ve got two ways forward that I’ll pursue:
-
The stale state detection task in seg-rep is not-useful while a node is shutting down maybe I can dequeue or swallow the scheduling from that system.
-
Failing that, I’m going to see if there is a way to keep honoring the contract - if there are callbacks - but not in other scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stale state detection task in seg-rep is not-useful while a node is shutting down maybe I can dequeue or swallow the scheduling from that system.
In some places, there are lifecycle check, when node goes down, the lifecycle check may not even proceed with the tasks (I would strongly advocate to not alter the pool behaviour - this is low level tool).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I found a way forward using the first path which is much cleaner in my mind, not 100% sure how I can validate other than get a number of test runs in. Let me know what you think of this.
Compatibility status:Checks if related components are compatible with change 7f2b997 Incompatible componentsIncompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git] Skipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git] |
Signed-off-by: Peter Nied <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
This reverts commit 1b3df2d. Signed-off-by: Peter Nied <[email protected]>
This reverts commit 1613000. Signed-off-by: Peter Nied <[email protected]>
This reverts commit 805a98f. Signed-off-by: Peter Nied <[email protected]>
This reverts commit 3d9ea0b. Signed-off-by: Peter Nied <[email protected]>
… shutdown Signed-off-by: Peter Nied <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Peter Nied <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Peter Nied <[email protected]>
/** | ||
* If the node is shutting down do not schedule this task. | ||
*/ | ||
protected boolean doNotRunWhenShuttingDown() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair, I have difficulties to understand the complete picture here (my apologies for that), the mental model I have in mind is this:
- all components should be closed on node shutdown
- the async tasks that components maintain should also be closed on shutdown
- the async task is not reschedule when closed
Taking SegmentReplicationPressureService
, I would expect to observe:
- Node::close()
- SegmentReplicationPressureService::close()
- AsyncFailStaleReplicaTask::close()
(same for PersistentTasksClusterService), what I am missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for asking for clarity - I've been making some assumptions and attempting to address them - thus those previous changes are not 'provable'. I'm going to take a step back and add some thread.sleeps
until I can manifest the issue locally.
I'll come back with some findings and address your workflow question. It is a complex space.
Gradle Check (Jenkins) Run Completed with:
|
I've been doing experiments for the past 2 days, adding
|
Description
Allow AsyncTasks to indicate if they should not be scheduled during a shutdown
Related Issues
Check List
New functionality has javadoc addedBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.