[BUG] Pending tasks not finished on node shutdown causing flaky tests #12114
Labels
bug
Something isn't working
flaky-test
Random test failure that succeeds on second run
Indexing:Replication
Issues and PRs related to core replication framework eg segrep
Describe the bug
This surfaced in the form of a test failure on https://build.ci.opensearch.org/job/gradle-check/32988/ with RankEvalRequestIT. I don't think this failure is specific to this test.
From logs the tests fail with:
java.lang.AssertionError: All incoming requests on node [node_s2] should have finished. Expected 0 but got 439; pending tasks
With open recoveries and retention_lease_syncs.This test is suite scope and does not have SegRep enabled.
Related component
Other
To Reproduce
The seeds are not reproducible and is rare. I am attempting to repro this with the same test but have yet to catch it after ~3k iterations.
Expected behavior
Tests should pass, all requests should be cleaned up between test iterations.
Additional Details
Some more trace hints that a shard lock was not obtained while attempting to create a shard, meaning a store reference is likely still open from a previous shutdown.
With start_recovery as one of the tasks my suspicion is there is a case where a RecoveryTarget is not closed before shard/node shutdown.
The text was updated successfully, but these errors were encountered: