-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry download of RemoteFSTranslog to fix transient race conditions #9565
Conversation
Gradle Check (Jenkins) Run Completed with:
|
server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Primary to Primary relocation , there can be concurrent upload and download of translog.
When you say concurrent upload and download, are we referring to concurrency due to old primary uploading and new primary downloading?
While translog files are getting downloaded by new primary, it might hence be deleted by the primary
Hence we retry if tlog/ckp files are not found .
I did not follow through this. The older primary can delete the translog when the new primary is trying to download the same? If so, is there a way we can disable the translog deletion somehow and disallow the erroneous state.
server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/opensearch/index/translog/RemoteFSTranslogTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/opensearch/index/translog/RemoteFSTranslogTests.java
Outdated
Show resolved
Hide resolved
51ab504
to
3cae883
Compare
3cae883
to
65e1f9e
Compare
Gradle Check (Jenkins) Run Completed with:
|
Compatibility status:Checks if related components are compatible with change 3cae883 Incompatible componentsIncompatible components: [https://github.com/opensearch-project/asynchronous-search.git] Skipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git] |
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #9565 +/- ##
===========================================
Coverage 71.04% 71.05%
+ Complexity 57821 57508 -313
===========================================
Files 4818 4781 -37
Lines 273093 271191 -1902
Branches 39811 39593 -218
===========================================
- Hits 194028 192700 -1328
+ Misses 62762 62257 -505
+ Partials 16303 16234 -69
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this 🙌
Lets close this once we have good confidence on #9603. |
Signed-off-by: Gaurav Bafna <[email protected]>
Signed-off-by: Gaurav Bafna <[email protected]>
65e1f9e
to
93357ce
Compare
Gradle Check (Jenkins) Run Completed with:
|
Compatibility status:Checks if related components are compatible with change 93357ce Incompatible componentsIncompatible components: [https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/neural-search.git] Skipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git] |
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Gaurav Bafna <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Compatibility status:Checks if related components are compatible with change 7045e94 Incompatible componentsIncompatible components: [https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/neural-search.git] Skipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git] |
Flaky test failing : #9688 |
…9565) Signed-off-by: Gaurav Bafna <[email protected]> (cherry picked from commit ff2b127) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…9565) (#9798) (cherry picked from commit ff2b127) Signed-off-by: Gaurav Bafna <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…pensearch-project#9565) Signed-off-by: Gaurav Bafna <[email protected]> Signed-off-by: Kaushal Kumar <[email protected]>
…pensearch-project#9565) Signed-off-by: Gaurav Bafna <[email protected]> Signed-off-by: Ivan Brusic <[email protected]>
…pensearch-project#9565) Signed-off-by: Gaurav Bafna <[email protected]> Signed-off-by: Shivansh Arora <[email protected]>
Description
Retrying translog download if it fails due to no file found for a finite time
In Primary to Primary relocation , there can be concurrent upload and download of translog.
While translog files are getting downloaded by new primary, it might hence be deleted by the primary
Hence we retry if tlog/ckp files are not found .
This doesn't happen in last download , where it is ensured that older primary has stopped modifying tlog data.
Related Issues
Resolves #9191
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.