Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow AsyncTasks to indicate if they should not be scheduled during a shutdown #10860

Closed
wants to merge 11 commits into from

Conversation

peternied
Copy link
Member

@peternied peternied commented Oct 23, 2023

Description

Allow AsyncTasks to indicate if they should not be scheduled during a shutdown

Related Issues

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Peter Nied <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@@ -64,6 +68,10 @@ public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) {
}
}
rejected.inc();
if (executor.isTerminating() || executor.isTerminated() || executor.isShutdown()) {
Copy link
Collaborator

@reta reta Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peternied we should not do that: the listeners will be called on rejection to notify the caller that the submission failed, with this change it won't happen anymore, leaving some callbacks in dangling state

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reta Thanks for the feedback, I picked an ugly issue to attempt to root cause and fix in the flaky test space - I might be making assumptions that are not well founded.

I understand from the programming purity - this is against the conventions of executors; however, aren't the only cases where we are shutting down executors when we are shutting down nodes? Is there a better way to determine that the cluster is going down so these sources of failures can be ignored?

My goal with this change is to stabilize the test infrastructure assuming this change passes CI, don't we have sufficient coverage to accept this change?

Copy link
Collaborator

@reta reta Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand from the programming purity - this is against the conventions of executors; however, aren't the only cases where we are shutting down executors when we are shutting down nodes?

It may be the case, but even than we need a clean shutdown, tracing comes immediately as an example here - shutting down the node should gracefully flush the data to collector but with this change, the trace spans won't be flushed since they will be left in dangling state.

My goal with this change is to stabilize the test infrastructure assuming this change passes CI, don't we have sufficient coverage to accept this change?

We should have coverage but we may end up well in a flaky state. But this is not really about coverage - we break the contract by not calling the callbacks when we should.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To revisit;

Is there a better way to determine that the cluster is going down so these sources of failures can be ignored?

I’ve got two ways forward that I’ll pursue:

  1. The stale state detection task in seg-rep is not-useful while a node is shutting down maybe I can dequeue or swallow the scheduling from that system.

  2. Failing that, I’m going to see if there is a way to keep honoring the contract - if there are callbacks - but not in other scenarios.

Copy link
Collaborator

@reta reta Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stale state detection task in seg-rep is not-useful while a node is shutting down maybe I can dequeue or swallow the scheduling from that system.

In some places, there are lifecycle check, when node goes down, the lifecycle check may not even proceed with the tasks (I would strongly advocate to not alter the pool behaviour - this is low level tool).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I found a way forward using the first path which is much cleaner in my mind, not 100% sure how I can validate other than get a number of test runs in. Let me know what you think of this.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 23, 2023

Compatibility status:

Checks if related components are compatible with change 7f2b997

Incompatible components

Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git]

Signed-off-by: Peter Nied <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

This reverts commit 1b3df2d.

Signed-off-by: Peter Nied <[email protected]>
This reverts commit 1613000.

Signed-off-by: Peter Nied <[email protected]>
This reverts commit 805a98f.

Signed-off-by: Peter Nied <[email protected]>
@github-actions github-actions bot added the flaky-test Random test failure that succeeds on second run label Oct 23, 2023
@peternied peternied changed the title Silently ignore rejections when threadpools are terminating Allow AsyncTasks to indicate if they should not be scheduled during a shutdown Oct 23, 2023
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

/**
* If the node is shutting down do not schedule this task.
*/
protected boolean doNotRunWhenShuttingDown() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, I have difficulties to understand the complete picture here (my apologies for that), the mental model I have in mind is this:

  • all components should be closed on node shutdown
  • the async tasks that components maintain should also be closed on shutdown
  • the async task is not reschedule when closed

Taking SegmentReplicationPressureService, I would expect to observe:

  • Node::close()
  • SegmentReplicationPressureService::close()
  • AsyncFailStaleReplicaTask::close()

(same for PersistentTasksClusterService), what I am missing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for asking for clarity - I've been making some assumptions and attempting to address them - thus those previous changes are not 'provable'. I'm going to take a step back and add some thread.sleeps until I can manifest the issue locally.

I'll come back with some findings and address your workflow question. It is a complex space.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@peternied peternied marked this pull request as draft October 25, 2023 12:26
@peternied
Copy link
Member Author

I've been doing experiments for the past 2 days, adding Thread.sleep(20) statements all over, I've been unable to find any reproduction. I'm going to close out the issue, as I'd rather find a strong reproduction in another test case that hold this up anymore. Thanks for looking into this with me @reta

./gradlew ':modules:repository-url:internalClusterTest' -Dtests.iters=100 --tests "org.opensearch.repositories.url.URLSnapshotRestoreIT.testUrlRepository" -Dtests.seed=274F20795072D986 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=mt-MT -Dtests.timezone=Canada/Saskatchewan

@peternied peternied closed this Nov 2, 2023
@peternied peternied deleted the flaky-1 branch November 2, 2023 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Flaky test in 2.x - org.opensearch.repositories.url.URLSnapshotRestoreIT.testUrlRepository
2 participants