-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flakiness with SegmentReplicationSuiteIT #11977
Conversation
Compatibility status:Checks if related components are compatible with change 13522a0 Incompatible componentsSkipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git] |
This PR is stalled because it has been open for 30 days with no activity. |
❕ Gradle check result for 8687594: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these changes I have run the entire suite ~2500 times without failure.
@mch2 This is good - and its a shame we haven't gotten this change reviewed.
I'll try to prioritize this in my review queue, I'm not familiar with the replication space - is there anything that warrants extra attention that you'd like focused on?
@peternied Thanks for taking a look here. @andrross offered to review this last week and I had asked him to pause until I revisit it as its been a while but I think this is good to go. I believe this will fix more flakiness with #12408 that popped up after the introduction of more tests executing with SegRep as the root cause there looks the same. The key components of this PR that may warrant extra attention are in moving the cancellation to after the shard is closed (beforeIndexShardClosed -> AfterIndexShardClosed) to ensure open file refs are closed only after the shard is, preventing the possibility of a race between cancel & shard closure. Second is the removal of caching the |
This test fails because of a race during shard/node shutdown with node-node replication. Fixed by properly synchronizing creation of new replication events with cancellation and cancelling after shards are closed. Signed-off-by: Marc Handalian <[email protected]>
This change removes the responsibility of caching CopyState inside of OngoingSegmentReplications. 1. CopyState was originally cached to prevent frequent disk reads while building segment metadata. This is now cached lower down in IndexShard and is not required here. 2. Change prepareForReplication method to return SegmentReplicationSourceHandler directly 3. Move responsibility of creating and clearing CopyState to the handler. Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
@andrross Apologies for stalling here - After running both this suite & IndexActionIT I do not get any failures and think this is ready. |
❕ Gradle check result for 13522a0: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about replication but the code looks good. I left one comment/question around the fetching of the metadata map but assuming that is all good should be ready to merge!
server/src/main/java/org/opensearch/indices/replication/common/CopyState.java
Show resolved
Hide resolved
* Fix SegmentReplicationSuiteIT This test fails because of a race during shard/node shutdown with node-node replication. Fixed by properly synchronizing creation of new replication events with cancellation and cancelling after shards are closed. Signed-off-by: Marc Handalian <[email protected]> * Remove CopyState caching from OngoingSegmentReplications. This change removes the responsibility of caching CopyState inside of OngoingSegmentReplications. 1. CopyState was originally cached to prevent frequent disk reads while building segment metadata. This is now cached lower down in IndexShard and is not required here. 2. Change prepareForReplication method to return SegmentReplicationSourceHandler directly 3. Move responsibility of creating and clearing CopyState to the handler. Signed-off-by: Marc Handalian <[email protected]> * Fix comment for afterIndexShardClosed method. Signed-off-by: Marc Handalian <[email protected]> * Fix comment on beforeIndexShardClosed Signed-off-by: Marc Handalian <[email protected]> * Remove unnecessary method from OngoingSegmentReplications Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> (cherry picked from commit e828c18) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Fix SegmentReplicationSuiteIT This test fails because of a race during shard/node shutdown with node-node replication. Fixed by properly synchronizing creation of new replication events with cancellation and cancelling after shards are closed. * Remove CopyState caching from OngoingSegmentReplications. This change removes the responsibility of caching CopyState inside of OngoingSegmentReplications. 1. CopyState was originally cached to prevent frequent disk reads while building segment metadata. This is now cached lower down in IndexShard and is not required here. 2. Change prepareForReplication method to return SegmentReplicationSourceHandler directly 3. Move responsibility of creating and clearing CopyState to the handler. * Fix comment for afterIndexShardClosed method. * Fix comment on beforeIndexShardClosed * Remove unnecessary method from OngoingSegmentReplications --------- (cherry picked from commit e828c18) Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
This PR fixes race conditions while shutting down SegmentReplicationSourceService used for node-node segment replication that cause store refs to not close before checks are made. This is the cause for flakiness with SegmentReplicationSuiteIT and a few others.
The issue here is we are creating new replications on the primary after we have cancelled replications but before the shard is closed. OngoingSegmentReplications has a
prepareForReplications
method that previously would invokegetCachedCopyState
. In that method we would first fetch copyState from the map and incref it if it exists. This left us open to incref the copyState ref even though the shard is shutting down & no longer in indexService.To fix this this change removes the responsibility of caching CopyState inside of OngoingSegmentReplications.
cached lower down in IndexShard and is not required here. This eliminates the need to synchronize on the object for create/cancel methods.
testDropRandomNodeDuringReplication
to wait until shards have recovered before deleting the index. The intent of the test was to ensure we can recover after the node drop. The subsequent testtestDeleteIndexWhileReplicating
covers paths for index deletion while replication is running.With these changes I have run the entire suite ~2500 times without failure.
Related Issues
Resolves #9499
Check List
New functionality has been documented.New functionality has javadoc addedCommit changes are listed out in CHANGELOG.md file (See: Changelog)Public documentation issue/PR createdBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.