Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pending tasks not finished on node shutdown causing flaky tests #12114

Open
mch2 opened this issue Jan 31, 2024 · 4 comments
Open

[BUG] Pending tasks not finished on node shutdown causing flaky tests #12114

mch2 opened this issue Jan 31, 2024 · 4 comments
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@mch2
Copy link
Member

mch2 commented Jan 31, 2024

Describe the bug

This surfaced in the form of a test failure on https://build.ci.opensearch.org/job/gradle-check/32988/ with RankEvalRequestIT. I don't think this failure is specific to this test.

From logs the tests fail with: java.lang.AssertionError: All incoming requests on node [node_s2] should have finished. Expected 0 but got 439; pending tasks With open recoveries and retention_lease_syncs.

This test is suite scope and does not have SegRep enabled.

Related component

Other

To Reproduce

The seeds are not reproducible and is rare. I am attempting to repro this with the same test but have yet to catch it after ~3k iterations.

Expected behavior

Tests should pass, all requests should be cleaned up between test iterations.

Additional Details

Some more trace hints that a shard lock was not obtained while attempting to create a shard, meaning a store reference is likely still open from a previous shutdown.


[2024-01-31T23:55:09,064][WARN ][o.o.i.c.IndicesClusterStateService] [node_s2] [test2][9] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
	at org.opensearch.index.IndexService.createShard(IndexService.java:539) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.IndicesService.createShard(IndicesService.java:1021) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.IndicesService.createShard(IndicesService.java:211) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:672) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:649) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:294) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:608) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:595) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:563) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.opensearch.env.ShardLockObtainFailedException: [test2][9]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [5452ms]
	at org.opensearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:851) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:748) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.IndexService.createShard(IndexService.java:476) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	... 16 more

With start_recovery as one of the tasks my suspicion is there is a case where a RecoveryTarget is not closed before shard/node shutdown.

@mch2 mch2 added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run labels Jan 31, 2024
@github-actions github-actions bot added the Other label Jan 31, 2024
@mch2
Copy link
Member Author

mch2 commented Feb 3, 2024

Have been poking at this and able to repro but very rarely 1/5k iterations or so.

From debugging there is an open retention lease sync request to the primary node that in turn causes the recovery that invoked the clone retention lease step to not complete. Will spend some time adding some additional trace to see why the sync req is not getting returned to the caller.

@mch2 mch2 self-assigned this Feb 6, 2024
@andrross andrross added Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed Other labels Feb 21, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5]
@mch2 Thanks for filing, looking forward to seeing this issue resolved.

@shwetathareja
Copy link
Member

shwetathareja commented Mar 27, 2024

obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [5452ms]

I have observed this before in actual cluster when background thread is not able to close the shards successfully and the same shard get assigned to that node by cluster manager, essentially a race condition. This seems like an actual code issue instead of test flakiness. Need to confirm though.

@mch2
Copy link
Member Author

mch2 commented Apr 2, 2024

Thanks @shwetathareja. I still have a tough time reproducing this but I think this unclosed listener is a contributor. Will test that out again and re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
4 participants