Remove compounding retries within PrimaryShardReplicationSource #12043

mch2 · 2024-01-26T20:55:37Z

Description

This change fixes and unmutes segrep bwc test testIndexingWithSegRep.

This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication. This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs. The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur on node disconnect. The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout.

This change also includes simplifications to the cancellation flow on the target service before the shard is closed. Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown. The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw. Instead we now immediately remove the target and decref/close it.

Related Issues

Resolves #7679

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)
Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-01-26T21:13:46Z

Compatibility status:

Checks if related components are compatible with change 44e03b5

Incompatible components

Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/alerting.git]

github-actions · 2024-01-26T22:43:33Z

✅ Gradle check result for c1108e3: SUCCESS

codecov · 2024-01-26T22:47:16Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (6012504) 71.28% compared to head (44e03b5) 71.47%.
Report is 2 commits behind head on main.

Files	Patch %	Lines
...s/replication/SegmentReplicationTargetService.java	66.66%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #12043      +/-   ##
============================================
+ Coverage     71.28%   71.47%   +0.19%     
- Complexity    59414    59542     +128     
============================================
  Files          4925     4925              
  Lines        279479   279472       -7     
  Branches      40635    40636       +1     
============================================
+ Hits         199226   199759     +533     
+ Misses        63731    63119     -612     
- Partials      16522    16594      +72

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java

github-actions · 2024-01-29T18:45:43Z

❌ Gradle check result for 085d9c6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

mch2 · 2024-01-29T18:55:11Z

❌ Gradle check result for 085d9c6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

#11974

github-actions · 2024-01-29T19:35:25Z

❌ Gradle check result for f64e85f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

dreamer-89 · 2024-01-29T19:55:05Z

❌ Gradle check result for f64e85f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Gradle build failure due to single test. May be missing rebase against main ?
https://build.ci.opensearch.org/job/gradle-check/32810/testReport/org.opensearch.qa.verify_version_constants/VerifyVersionConstantsIT/testLuceneVersionConstant/

java.lang.AssertionError: 
Expected: <9.9.2>
     but: was <9.9.1>

github-actions · 2024-01-29T21:03:19Z

❕ Gradle check result for 094139d: UNSTABLE

TEST FAILURES:

      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

mch2 · 2024-01-29T21:59:02Z

testPrimaryRelocationWhileIndexing

#9191

This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication. This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs. The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur on node disconnect. The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout. This change also includes simplifications to the cancellation flow on the target service before the shard is closed. Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown. The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw. Instead we now immediately remove the target and decref/close it. Signed-off-by: Marc Handalian <[email protected]>

mch2 · 2024-01-29T22:02:21Z

Apologies for losing the history here, I force pushed to rebase & fix DCO

github-actions · 2024-01-29T22:51:29Z

✅ Gradle check result for 44e03b5: SUCCESS

…search-project#12043) This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication. This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs. The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur on node disconnect. The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout. This change also includes simplifications to the cancellation flow on the target service before the shard is closed. Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown. The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw. Instead we now immediately remove the target and decref/close it. Signed-off-by: Marc Handalian <[email protected]>

This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication. This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs. The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur on node disconnect. The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout. This change also includes simplifications to the cancellation flow on the target service before the shard is closed. Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown. The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw. Instead we now immediately remove the target and decref/close it. Signed-off-by: Marc Handalian <[email protected]> (cherry picked from commit 11644d5) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…) (#12800) This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication. This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs. The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur on node disconnect. The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout. This change also includes simplifications to the cancellation flow on the target service before the shard is closed. Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown. The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw. Instead we now immediately remove the target and decref/close it. (cherry picked from commit 11644d5) Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…search-project#12043) This change removes retries within PrimaryShardReplicationSource and relies on retries in one place at the start of replication. This is done within SegmentReplicationTargetService's processLatestReceivedCheckpoint after a failure/success occurs. The timeout on these retries is the cause of flaky failures from SegmentReplication's bwc test within IndexingIT, that can occur on node disconnect. The retries will persist for over ~1m to the same primary node that has been relocated/shut down and cause the test to timeout. This change also includes simplifications to the cancellation flow on the target service before the shard is closed. Previously we "request" a cancel that does not remove the target from the ongoing replications collection until a cancellation failure is thrown. The transport calls from PrimaryShardReplicationSource are no longer wrapped in CancellableThreads by the client so a call to "cancel" will not throw. Instead we now immediately remove the target and decref/close it. Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: Shivansh Arora <[email protected]>

This comment was marked as outdated.

Sign in to view

mch2 mentioned this pull request Jan 26, 2024

Fix flakiness with SegmentReplicationSuiteIT #11977

Merged

8 tasks

andrross reviewed Jan 26, 2024

View reviewed changes

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java Outdated Show resolved Hide resolved

github-actions bot added bug Something isn't working distributed framework flaky-test Random test failure that succeeds on second run labels Jan 29, 2024

mch2 added the skip-changelog label Jan 29, 2024

mch2 marked this pull request as ready for review January 29, 2024 18:31

mch2 requested review from abbashus, adnapibar, anasalkouz, Bukhtawar, CEHENKLE, dblock, dbwiddis, dreamer-89, gbbafna, kartg, kotwanikunal, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, ryanbogan, sachinpkale and saratvemulapalli as code owners January 29, 2024 18:31

mch2 requested review from shwetathareja, sohami, tlfeng and VachaShah as code owners January 29, 2024 18:31

mch2 requested a review from andrross January 29, 2024 18:40

This comment was marked as outdated.

Sign in to view

mch2 force-pushed the srbwc branch from f64e85f to 094139d Compare January 29, 2024 20:11

mch2 force-pushed the srbwc branch from 094139d to 44e03b5 Compare January 29, 2024 22:01

andrross approved these changes Jan 30, 2024

View reviewed changes

mch2 merged commit 11644d5 into opensearch-project:main Jan 30, 2024
30 checks passed

mch2 added the backport 2.x Backport to 2.x branch label Mar 20, 2024

opensearch-trigger-bot bot mentioned this pull request Mar 20, 2024

[Backport 2.x] Remove compounding retries within PrimaryShardReplicationSource #12800

Merged

mch2 mentioned this pull request Mar 20, 2024

[Backport 2.x] Fix flaky test SegmentReplicationWithNodeToNodeIndexShardTests#testReplicaClosesWhileReplicating_AfterGetCheckpoint #12741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove compounding retries within PrimaryShardReplicationSource #12043

Remove compounding retries within PrimaryShardReplicationSource #12043

mch2 commented Jan 26, 2024 •

edited

Loading

github-actions bot commented Jan 26, 2024 •

edited

Loading

This comment was marked as outdated.

github-actions bot commented Jan 26, 2024

codecov bot commented Jan 26, 2024 •

edited

Loading

github-actions bot commented Jan 29, 2024

mch2 commented Jan 29, 2024

This comment was marked as outdated.

github-actions bot commented Jan 29, 2024

dreamer-89 commented Jan 29, 2024 •

edited

Loading

github-actions bot commented Jan 29, 2024

mch2 commented Jan 29, 2024

mch2 commented Jan 29, 2024 •

edited

Loading

github-actions bot commented Jan 29, 2024

Remove compounding retries within PrimaryShardReplicationSource #12043

Remove compounding retries within PrimaryShardReplicationSource #12043

Conversation

mch2 commented Jan 26, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Jan 26, 2024 • edited Loading

Compatibility status:

Incompatible components

Skipped components

Compatible components

This comment was marked as outdated.

github-actions bot commented Jan 26, 2024

codecov bot commented Jan 26, 2024 • edited Loading

Codecov Report

github-actions bot commented Jan 29, 2024

mch2 commented Jan 29, 2024

This comment was marked as outdated.

github-actions bot commented Jan 29, 2024

dreamer-89 commented Jan 29, 2024 • edited Loading

github-actions bot commented Jan 29, 2024

mch2 commented Jan 29, 2024

mch2 commented Jan 29, 2024 • edited Loading

github-actions bot commented Jan 29, 2024

mch2 commented Jan 26, 2024 •

edited

Loading

github-actions bot commented Jan 26, 2024 •

edited

Loading

codecov bot commented Jan 26, 2024 •

edited

Loading

dreamer-89 commented Jan 29, 2024 •

edited

Loading

mch2 commented Jan 29, 2024 •

edited

Loading