Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SimpleNestedExplainIT.testExplainMultipleDocs flakiness #12776

Merged
merged 1 commit into from
Mar 22, 2024

Conversation

neetikasinghal
Copy link
Contributor

@neetikasinghal neetikasinghal commented Mar 19, 2024

Description

The explain itself is coming from Lucene: https://github.com/apache/lucene/blob/a6f70ad2bb0b682eb65feb522358ee6d16cad766/lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java#L432-L440

Here's some info on the docIds in lucene:

  • A DocId in lucene is not actually unique to the Index but is unique to a Segment. Lucene does this mainly to optimize writing and compression. Since it is only unique to a Segment, how can a Doc be uniquely identified at the Index level? The solution is simple. The segments are ordered. To take a simple example, an Index has two segments and each segment has 100 docs respectively. The DocId's in the Segment are 0-100 but when they are converted to the Index level, the range of the DocId's in the second Segment is converted to 100-200.

  • DocId's are unique within a Segment, numbered progressively from zero. But this does not mean that the DocId's are continuous. When a Doc is deleted, there is a gap.

  • The DocId corresponding to a document can change, usually when Segments are merged.

The test fails on the validation of range of the docIds in the assertion, the range changes as with indexRandomForMultipleSlices function there are several bogus documents ingested and deleted which could trigger background merges and cause the range of the docIds matched with the search query to change.

One of the solutions is to not the delete the bogus documents as part of the indexRandomForMultipleSlices, perform search for concurrent search and non-concurrent search and delete the bogus docs at the end. This is done as part of this PR.

Validated the success of the tests for 500 times.

Relates: #11681

Related Issues

Resolves #12318

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@neetikasinghal
Copy link
Contributor Author

@reta @jed326 please review

Copy link
Contributor

❌ Gradle check result for f9cb4f5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Mar 19, 2024

Compatibility status:

Checks if related components are compatible with change 8e3688e

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git]

@jed326
Copy link
Collaborator

jed326 commented Mar 20, 2024

FYI I think the "resolves" issue is wrong:

Copy link
Collaborator

@jed326 jed326 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I do not understand the problem correctly, but If the issue here is background merges are happening between the concurrent and non-concurrent search queries, would a simpler solution be to add a small sleep after indexing is completed so we wait for background merges to complete before we perform the search request?
Or we could even force merge to 2 segments ourselves?

It seems like making deletion of bogus docs optional is a pretty heavy duty solution to what seems like a pretty simple problem.

Copy link
Contributor

✅ Gradle check result for fce9d57: SUCCESS

Copy link

codecov bot commented Mar 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.38%. Comparing base (b15cb0c) to head (d346c9f).
Report is 71 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #12776      +/-   ##
============================================
- Coverage     71.42%   71.38%   -0.04%     
- Complexity    59978    60282     +304     
============================================
  Files          4985     5011      +26     
  Lines        282275   283659    +1384     
  Branches      40946    41117     +171     
============================================
+ Hits         201603   202501     +898     
- Misses        63999    64425     +426     
- Partials      16673    16733      +60     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added bug Something isn't working flaky-test Random test failure that succeeds on second run Search Search query, autocomplete ...etc labels Mar 20, 2024
@neetikasinghal
Copy link
Contributor Author

add a small sleep after indexing is completed so we wait for background merges to complete before we perform the search request?
Or we could even force merge to 2 segments ourselves?

@jed326 I thought the same, but looks like even a sleep time of 5 seconds or force merge (which defeats the purpose of indexing multiple docs)doesn't guarantee for successful runs of the tests.

Hence, I came up with this approach of not deleting the bogus docs before the search.

@jed326
Copy link
Collaborator

jed326 commented Mar 20, 2024

Hence, I came up with this approach of not deleting the bogus docs before the search.

Instead of adding bogus docs could we just add more "real" docs instead?

@neetikasinghal
Copy link
Contributor Author

Hence, I came up with this approach of not deleting the bogus docs before the search.

Instead of adding bogus docs could we just add more "real" docs instead?

I think it would be same as calling indexRandomForSlices function with deletion of bogus docs set to false, which is currently happening. This would cause code duplication.

@jed326
Copy link
Collaborator

jed326 commented Mar 20, 2024

I think it would be same as calling indexRandomForSlices function with deletion of bogus docs set to false,

This would ingest bogus docs while what I was suggesting is modify the test to ingest more (looks like only 1 more would be needed?) non-bogus docs.

Overall it feels like not deleting bogus docs is a sledgehammer solution approach here that I don't think is generically useful for other tests so in my opinion it looks like a better solution to modify the singular test that is flaky.

@neetikasinghal
Copy link
Contributor Author

I think it would be same as calling indexRandomForSlices function with deletion of bogus docs set to false,

This would ingest bogus docs while what I was suggesting is modify the test to ingest more (looks like only 1 more would be needed?) non-bogus docs.

Overall it feels like not deleting bogus docs is a sledgehammer solution approach here that I don't think is generically useful for other tests so in my opinion it looks like a better solution to modify the singular test that is flaky.

ok, so I modified the change to ingest one more doc and just have one shard for this test class such that multiple slices are created and there is no need to call the indexRandomForMultipleSlices for this test. As you said, we can also avoid the modification of the function as well. @jed326 thanks for your feedback.

Copy link
Contributor

❌ Gradle check result for 8e3688e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Collaborator

@jed326 jed326 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I think this is a much more straightforward solution. Thanks!

Copy link
Contributor

✅ Gradle check result for d346c9f: SUCCESS

@neetikasinghal
Copy link
Contributor Author

@reta could u also take a look and merge if it looks good to you?

@dblock dblock merged commit 4010ff1 into opensearch-project:main Mar 22, 2024
33 of 40 checks passed
@dblock dblock added the backport 2.x Backport to 2.x branch label Mar 22, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 22, 2024
Signed-off-by: Neetika Singhal <[email protected]>
(cherry picked from commit 4010ff1)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
reta pushed a commit that referenced this pull request Mar 22, 2024
…12859)

(cherry picked from commit 4010ff1)

Signed-off-by: Neetika Singhal <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@neetikasinghal neetikasinghal self-assigned this Mar 28, 2024
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working flaky-test Random test failure that succeeds on second run Search Search query, autocomplete ...etc skip-changelog
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] org.opensearch.search.nested.SimpleNestedExplainIT.testExplainMultipleDocs is flaky
4 participants