[BUG] Pending tasks not finished on node shutdown causing flaky tests #12114

mch2 · 2024-01-31T19:55:28Z

Describe the bug

This surfaced in the form of a test failure on https://build.ci.opensearch.org/job/gradle-check/32988/ with RankEvalRequestIT. I don't think this failure is specific to this test.

From logs the tests fail with: java.lang.AssertionError: All incoming requests on node [node_s2] should have finished. Expected 0 but got 439; pending tasks With open recoveries and retention_lease_syncs.

This test is suite scope and does not have SegRep enabled.

Related component

Other

To Reproduce

The seeds are not reproducible and is rare. I am attempting to repro this with the same test but have yet to catch it after ~3k iterations.

Expected behavior

Tests should pass, all requests should be cleaned up between test iterations.

Additional Details

Some more trace hints that a shard lock was not obtained while attempting to create a shard, meaning a store reference is likely still open from a previous shutdown.


[2024-01-31T23:55:09,064][WARN ][o.o.i.c.IndicesClusterStateService] [node_s2] [test2][9] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
	at org.opensearch.index.IndexService.createShard(IndexService.java:539) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.IndicesService.createShard(IndicesService.java:1021) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.IndicesService.createShard(IndicesService.java:211) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:672) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:649) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:294) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:608) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:595) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:563) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.opensearch.env.ShardLockObtainFailedException: [test2][9]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [5452ms]
	at org.opensearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:851) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:748) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.IndexService.createShard(IndexService.java:476) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	... 16 more

With start_recovery as one of the tasks my suspicion is there is a case where a RecoveryTarget is not closed before shard/node shutdown.

The text was updated successfully, but these errors were encountered:

mch2 · 2024-02-03T00:43:05Z

Have been poking at this and able to repro but very rarely 1/5k iterations or so.

From debugging there is an open retention lease sync request to the primary node that in turn causes the recovery that invoked the clone retention lease step to not complete. Will spend some time adding some additional trace to see why the sync req is not getting returned to the caller.

peternied · 2024-03-06T16:41:08Z

[Triage - attendees 1 2 3 4 5]
@mch2 Thanks for filing, looking forward to seeing this issue resolved.

shwetathareja · 2024-03-27T20:18:55Z

obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [5452ms]

I have observed this before in actual cluster when background thread is not able to close the shards successfully and the same shard get assigned to that node by cluster manager, essentially a race condition. This seems like an actual code issue instead of test flakiness. Need to confirm though.

mch2 · 2024-04-02T23:43:10Z

Thanks @shwetathareja. I still have a tough time reproducing this but I think this unclosed listener is a contributor. Will test that out again and re-open.

mch2 added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run labels Jan 31, 2024

github-actions bot added the Other label Jan 31, 2024

mch2 mentioned this issue Jan 31, 2024

Fix flaky SegRep test testScrollCreatedOnReplica #12077

Merged

8 tasks

mch2 mentioned this issue Feb 6, 2024

Resolve listener passed to WriteActionReplicasProxy#failShardIfNeeded when replica not failed #12181

Closed

8 tasks

mch2 self-assigned this Feb 6, 2024

andrross added Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed Other labels Feb 21, 2024

peternied removed the untriaged label Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Pending tasks not finished on node shutdown causing flaky tests #12114

[BUG] Pending tasks not finished on node shutdown causing flaky tests #12114

mch2 commented Jan 31, 2024 •

edited

Loading

mch2 commented Feb 3, 2024

peternied commented Mar 6, 2024

shwetathareja commented Mar 27, 2024 •

edited

Loading

mch2 commented Apr 2, 2024

[BUG] Pending tasks not finished on node shutdown causing flaky tests #12114

[BUG] Pending tasks not finished on node shutdown causing flaky tests #12114

Comments

mch2 commented Jan 31, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

mch2 commented Feb 3, 2024

peternied commented Mar 6, 2024

shwetathareja commented Mar 27, 2024 • edited Loading

mch2 commented Apr 2, 2024

mch2 commented Jan 31, 2024 •

edited

Loading

shwetathareja commented Mar 27, 2024 •

edited

Loading