Allow AsyncTasks to indicate if they should not be scheduled during a shutdown #10860

peternied · 2023-10-23T13:24:01Z

Description

Allow AsyncTasks to indicate if they should not be scheduled during a shutdown

Related Issues

Fixes [BUG] Flaky test in 2.x - org.opensearch.repositories.url.URLSnapshotRestoreIT.testUrlRepository #10827

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- ~~New functionality has javadoc added~~
Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)
Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Fixes opensearch-project#10827 Signed-off-by: Peter Nied <[email protected]>

Signed-off-by: Peter Nied <[email protected]>

github-actions · 2023-10-23T13:36:38Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28825/
CommitID: 805a98f
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-10-23T13:37:37Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28826/
CommitID: 1613000
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

reta · 2023-10-23T13:45:54Z

server/src/main/java/org/opensearch/common/util/concurrent/OpenSearchAbortPolicy.java

@@ -64,6 +68,10 @@ public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) {
            }
        }
        rejected.inc();
+        if (executor.isTerminating() || executor.isTerminated() || executor.isShutdown()) {


@peternied we should not do that: the listeners will be called on rejection to notify the caller that the submission failed, with this change it won't happen anymore, leaving some callbacks in dangling state

@reta Thanks for the feedback, I picked an ugly issue to attempt to root cause and fix in the flaky test space - I might be making assumptions that are not well founded.

I understand from the programming purity - this is against the conventions of executors; however, aren't the only cases where we are shutting down executors when we are shutting down nodes? Is there a better way to determine that the cluster is going down so these sources of failures can be ignored?

My goal with this change is to stabilize the test infrastructure assuming this change passes CI, don't we have sufficient coverage to accept this change?

I understand from the programming purity - this is against the conventions of executors; however, aren't the only cases where we are shutting down executors when we are shutting down nodes?

It may be the case, but even than we need a clean shutdown, tracing comes immediately as an example here - shutting down the node should gracefully flush the data to collector but with this change, the trace spans won't be flushed since they will be left in dangling state.

My goal with this change is to stabilize the test infrastructure assuming this change passes CI, don't we have sufficient coverage to accept this change?

We should have coverage but we may end up well in a flaky state. But this is not really about coverage - we break the contract by not calling the callbacks when we should.

To revisit;

Is there a better way to determine that the cluster is going down so these sources of failures can be ignored?

I’ve got two ways forward that I’ll pursue:

The stale state detection task in seg-rep is not-useful while a node is shutting down maybe I can dequeue or swallow the scheduling from that system.

Failing that, I’m going to see if there is a way to keep honoring the contract - if there are callbacks - but not in other scenarios.

The stale state detection task in seg-rep is not-useful while a node is shutting down maybe I can dequeue or swallow the scheduling from that system.

In some places, there are lifecycle check, when node goes down, the lifecycle check may not even proceed with the tasks (I would strongly advocate to not alter the pool behaviour - this is low level tool).

I believe I found a way forward using the first path which is much cleaner in my mind, not 100% sure how I can validate other than get a number of test runs in. Let me know what you think of this.

github-actions · 2023-10-23T14:01:51Z

Compatibility status:

Checks if related components are compatible with change 7f2b997

Incompatible components

Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git]

Signed-off-by: Peter Nied <[email protected]>

github-actions · 2023-10-23T14:32:12Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28831/
CommitID: 1b3df2d
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

This reverts commit 1b3df2d. Signed-off-by: Peter Nied <[email protected]>

This reverts commit 1613000. Signed-off-by: Peter Nied <[email protected]>

This reverts commit 805a98f. Signed-off-by: Peter Nied <[email protected]>

This reverts commit 3d9ea0b. Signed-off-by: Peter Nied <[email protected]>

… shutdown Signed-off-by: Peter Nied <[email protected]>

github-actions · 2023-10-23T21:36:53Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28872/
CommitID: 0ffc227
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Nied <[email protected]>

github-actions · 2023-10-24T13:37:17Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28912/
CommitID: 6b06067
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Nied <[email protected]>

reta · 2023-10-24T14:05:28Z

server/src/main/java/org/opensearch/common/util/concurrent/AbstractAsyncTask.java

+    /**
+     * If the node is shutting down do not schedule this task.
+     */
+    protected boolean doNotRunWhenShuttingDown() {


To be fair, I have difficulties to understand the complete picture here (my apologies for that), the mental model I have in mind is this:

all components should be closed on node shutdown

the async tasks that components maintain should also be closed on shutdown

the async task is not reschedule when closed

Taking SegmentReplicationPressureService, I would expect to observe:

Node::close()

SegmentReplicationPressureService::close()

AsyncFailStaleReplicaTask::close()

(same for PersistentTasksClusterService), what I am missing here?

Thanks for asking for clarity - I've been making some assumptions and attempting to address them - thus those previous changes are not 'provable'. I'm going to take a step back and add some thread.sleeps until I can manifest the issue locally.

I'll come back with some findings and address your workflow question. It is a complex space.

github-actions · 2023-10-24T14:25:48Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/28913/
CommitID: 7f2b997
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

peternied · 2023-11-02T17:00:42Z

I've been doing experiments for the past 2 days, adding Thread.sleep(20) statements all over, I've been unable to find any reproduction. I'm going to close out the issue, as I'd rather find a strong reproduction in another test case that hold this up anymore. Thanks for looking into this with me @reta

./gradlew ':modules:repository-url:internalClusterTest' -Dtests.iters=100 --tests "org.opensearch.repositories.url.URLSnapshotRestoreIT.testUrlRepository" -Dtests.seed=274F20795072D986 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=mt-MT -Dtests.timezone=Canada/Saskatchewan

peternied added 2 commits October 23, 2023 13:05

Silently ignore rejections when threadpools are terminating

3d9ea0b

Fixes opensearch-project#10827 Signed-off-by: Peter Nied <[email protected]>

Update unit tests

805a98f

Signed-off-by: Peter Nied <[email protected]>

github-actions bot added the bug Something isn't working label Oct 23, 2023

Changelog entry

1613000

Signed-off-by: Peter Nied <[email protected]>

reta requested changes Oct 23, 2023

View reviewed changes

Fix spotless issues

1b3df2d

Signed-off-by: Peter Nied <[email protected]>

peternied added 5 commits October 23, 2023 21:05

Revert "Fix spotless issues"

cbb8d8d

This reverts commit 1b3df2d. Signed-off-by: Peter Nied <[email protected]>

Revert "Changelog entry"

f820030

This reverts commit 1613000. Signed-off-by: Peter Nied <[email protected]>

Revert "Update unit tests"

5816f8b

This reverts commit 805a98f. Signed-off-by: Peter Nied <[email protected]>

Revert "Silently ignore rejections when threadpools are terminating"

5709c45

This reverts commit 3d9ea0b. Signed-off-by: Peter Nied <[email protected]>

Allow AsyncTasks to indicate if they should not be scheduled during a…

0ffc227

… shutdown Signed-off-by: Peter Nied <[email protected]>

github-actions bot added the flaky-test Random test failure that succeeds on second run label Oct 23, 2023

peternied changed the title ~~Silently ignore rejections when threadpools are terminating~~ Allow AsyncTasks to indicate if they should not be scheduled during a shutdown Oct 23, 2023

Fix mocked threadpools + add unit tests

6b06067

Signed-off-by: Peter Nied <[email protected]>

Supress loggerUsageCheck on test method

7f2b997

Signed-off-by: Peter Nied <[email protected]>

reta reviewed Oct 24, 2023

View reviewed changes

peternied marked this pull request as draft October 25, 2023 12:26

peternied closed this Nov 2, 2023

peternied deleted the flaky-1 branch November 2, 2023 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow AsyncTasks to indicate if they should not be scheduled during a shutdown #10860

Allow AsyncTasks to indicate if they should not be scheduled during a shutdown #10860

peternied commented Oct 23, 2023 •

edited

Loading

github-actions bot commented Oct 23, 2023

github-actions bot commented Oct 23, 2023

reta Oct 23, 2023 •

edited

Loading

peternied Oct 23, 2023

reta Oct 23, 2023 •

edited

Loading

peternied Oct 23, 2023

reta Oct 23, 2023 •

edited

Loading

peternied Oct 23, 2023

github-actions bot commented Oct 23, 2023 •

edited

Loading

github-actions bot commented Oct 23, 2023

github-actions bot commented Oct 23, 2023

github-actions bot commented Oct 24, 2023

reta Oct 24, 2023

peternied Oct 25, 2023

github-actions bot commented Oct 24, 2023

peternied commented Nov 2, 2023

Allow AsyncTasks to indicate if they should not be scheduled during a shutdown #10860

Allow AsyncTasks to indicate if they should not be scheduled during a shutdown #10860

Conversation

peternied commented Oct 23, 2023 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Oct 23, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 23, 2023

Gradle Check (Jenkins) Run Completed with:

reta Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

peternied Oct 23, 2023

Choose a reason for hiding this comment

reta Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

peternied Oct 23, 2023

Choose a reason for hiding this comment

reta Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

peternied Oct 23, 2023

Choose a reason for hiding this comment

github-actions bot commented Oct 23, 2023 • edited Loading

Compatibility status:

Incompatible components

Skipped components

Compatible components

github-actions bot commented Oct 23, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 23, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 24, 2023

Gradle Check (Jenkins) Run Completed with:

reta Oct 24, 2023

Choose a reason for hiding this comment

peternied Oct 25, 2023

Choose a reason for hiding this comment

github-actions bot commented Oct 24, 2023

Gradle Check (Jenkins) Run Completed with:

peternied commented Nov 2, 2023

peternied commented Oct 23, 2023 •

edited

Loading

reta Oct 23, 2023 •

edited

Loading

reta Oct 23, 2023 •

edited

Loading

reta Oct 23, 2023 •

edited

Loading

github-actions bot commented Oct 23, 2023 •

edited

Loading