Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test RemoteIndexPrimaryRelocationIT #11614

Conversation

sachinpkale
Copy link
Member

@sachinpkale sachinpkale commented Dec 14, 2023

Description

  • During primary relocation, once new primary is added to the replication group, if the old primary gets the indexing requests, these requests are also forwarded to the new primary.
  • New primary, post processing these indexing requests, do not update the local checkpoint and sends the older local checkpoint in the replica response.
  • Code reference where RemoteFsTranslog does not update local checkpoint in ensureSynced: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java#L259
  • The primaryContext sent from old primary to new primary in the handoff process contains map of allocation ID to local checkpoint info. This map has older checkpoint for the new primary.
  • Local checkpoint of the new primary is updated during performSegRep step of the relocation.
  • So, when the new primary receives the primaryContext, the local checkpoint in the context vs actual checkpoint differs, failing this assertion.
  • In this bugfix, we compare the local checkpoint of new primary with that of old primary for remote backed indexes.

Related Issues

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 833c1b6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Dec 14, 2023

Compatibility status:

Checks if related components are compatible with change 9362e4f

Incompatible components

Incompatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/k-nn.git]

Copy link
Contributor

✅ Gradle check result for 303eff0: SUCCESS

Copy link

codecov bot commented Dec 14, 2023

Codecov Report

Attention: Patch coverage is 25.00000% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 71.38%. Comparing base (904c9a9) to head (9362e4f).
Report is 155 commits behind head on main.

Files Patch % Lines
...in/java/org/opensearch/index/shard/IndexShard.java 25.00% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #11614      +/-   ##
============================================
- Coverage     71.44%   71.38%   -0.07%     
+ Complexity    59397    59362      -35     
============================================
  Files          4923     4923              
  Lines        279178   279181       +3     
  Branches      40581    40582       +1     
============================================
- Hits         199470   199283     -187     
- Misses        63064    63315     +251     
+ Partials      16644    16583      -61     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

❕ Gradle check result for dfd8f7a: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.multipart.RemoteStoreMultipartIT.testNoSearchIdleForAnyReplicaCount

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

+ "] does not match checkpoint from primary context ["
+ primaryContext
+ "]";
if (isRemoteStoreEnabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add in some tests cases to confirm this behavior

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already a test exists: org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing but it is flaky without this change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the annotation below this codepath isn't hit, is the coverage tool incorrect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is: coverage tool does not consider code covered by integ tests

if (System.getProperty("tests.coverage")) {
reporting {
reports {
testCodeCoverageReport(JacocoCoverageReport) {
testType = TestSuiteType.UNIT_TEST
}
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add unit test around this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sachinpkale !

@peternied peternied self-assigned this Dec 14, 2023
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Jan 17, 2024
@sachinpkale sachinpkale force-pushed the handoff-failure-flaky-fix branch from dfd8f7a to 9362e4f Compare January 18, 2024 11:48
@sachinpkale sachinpkale requested a review from peternied January 18, 2024 11:50
Copy link
Contributor

❌ Gradle check result for 9362e4f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 9362e4f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Jan 18, 2024
Copy link
Contributor

❌ Gradle check result for 9362e4f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 9362e4f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 9362e4f: SUCCESS

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Mar 6, 2024
@sachinpkale
Copy link
Member Author

The test is fixed as part of #12494. Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants