-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Closing an index with ongoing primary failover can lead to deadlocks #11869
Comments
Found another instance of deadlock while running RemoteStoreRestoreIT.testRTSRestoreWithRefreshedDataPrimaryReplicaDown
Test failed with following error
|
@linuxpi I have seen this as well but not able to get a repeatable test. In any case I don't think we need the engineMutex while Will put up a pr to fix. |
Thanks @mch2 . I was able to repro this for multiple flaky tests i was trying to fix. This usually shows up when you run with more than 500 or even 1000 iterations. For the fix, certainly removing the lock would help and we should do it if the lock is unnecessary here. But i was thinking if we should also revisit the locking order in IndexShard for various flows to ensure the locking order is correct. |
Describe the bug
While replica is being promoted to primary to primary failover we create a new engine pointing to last successful commit and replay the translog till global checkpoint. During this processing we acquire multiple locks.
Around the same time if CLOSE is triggered on the index, the close flow and the failover flow at some pointing are competing for the same locks from each other, which in certain race conditions can lead to a deadlock.
Failover Thread
The failover thread gets BLOCKED at acquiring the
engineMutex
lock at the following pointOpenSearch/server/src/main/java/org/opensearch/index/shard/IndexShard.java
Lines 4752 to 4763 in 5c82ab8
It is already holding a readLock on the new engine being created as part of the failover. This lock is acquired here
OpenSearch/server/src/main/java/org/opensearch/index/translog/InternalTranslogManager.java
Lines 114 to 145 in 5c82ab8
IndexShard Close Thread
The close thread acquires the engineMutex lock at the following point
OpenSearch/server/src/main/java/org/opensearch/index/shard/IndexShard.java
Lines 1985 to 2013 in 5c82ab8
and tries to acquire writeLock on the new engine further in the close flow.
OpenSearch/server/src/main/java/org/opensearch/index/engine/Engine.java
Lines 2013 to 2025 in 5c82ab8
At this point, both threads are in a deadlock state.
Thread Dumps
Found this issue in the integ test during build - https://build.ci.opensearch.org/job/gradle-check/31461/testReport/junit/org.opensearch.remotestore/RemoteStoreRestoreIT/testRTSRestoreWithNoDataPostCommitPrimaryReplicaDown/
Related component
Other
To Reproduce
Still working on getting repro steps as this seems to be race condition causing the issue. Might be able to repro by slowing down translog recovery and closing the index while recovery is in progress.
Expected behavior
Closing the index while ongoing primary failover should never lead to deadlock
Additional Details
No response
The text was updated successfully, but these errors were encountered: