Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] [Segment Replication] Run all integration tests with segment replication enabled #6761

Closed
9 of 12 tasks
dreamer-89 opened this issue Mar 20, 2023 · 14 comments · Fixed by #11773
Closed
9 of 12 tasks
Assignees
Labels
Indexing:Replication Issues and PRs related to core replication framework eg segrep >test-failure Test failure from CI, local build, etc. v2.11.0 Issues and PRs related to version 2.11.0

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Mar 20, 2023

Leading up to GA #5147 of segment replication, we need to do a round of sanity testing with existing integration tests. This is to ensure segment replication is compatible with existing features.

This issue is to track effort on running these tests, identifying root cause of failures. The fix for individual failures can be tracked in separate issues. These tests should be RUN against 2.x branch as we targetting this exercise for SegRep GA (going in 2.7). We can start with server module and run all integration tests internalClusterTest. Once these failures are resolved, we can run remaining integration tests.

General steps:

When running these tests, we need to turn Segment Replication as default replication strategy (this is to overcome fixing replication type for each integration test). Below are the changes needed for this:
1 - update INDEX_REPLICATION_TYPE_SETTING in IndexMetadata to return ReplicationType.SEGMENT.
2 - Update ReplicaitonType.java to return SEGMENT if a NPE is caught.
3 - Update FeatureFlags to set REPLICATION_TYPE_SETTING’s default to true.
4 - Some tests use mockEngineFactory, change OpenSearchIntegTestCase#addMockInternalEngine to return false from this method by default.
5 - Some methods use MockEngineFactory as a node level plugin - override this by also updating the class

@Override
    public Engine newReadWriteEngine(EngineConfig config) {
        if (config.isReadOnlyReplica()) {
            return new NRTReplicationEngine(config);
        }
        return new MockInternalEngine(config, wrapper);
    }
@dreamer-89 dreamer-89 changed the title [Segment Replication] Run all integration tests with segment replication enabled [Meta] [Segment Replication] Run all integration tests with segment replication enabled Mar 20, 2023
@dreamer-89 dreamer-89 added >test-failure Test failure from CI, local build, etc. v2.7.0 labels Mar 21, 2023
@anasalkouz anasalkouz moved this from Todo to In Progress in Segment Replication Apr 13, 2023
@dreamer-89 dreamer-89 added the v2.8.0 'Issues and PRs related to version v2.8.0' label Apr 18, 2023
@mch2 mch2 removed the v2.7.0 label May 2, 2023
@mch2
Copy link
Member

mch2 commented Jun 9, 2023

We are still working on enabling SR tests with randomization across our entire suite.

The issue with randomly running SR for all tests is that assertions do not wait for replication to complete. So we consistently see doc count assertions break.

To fix this we would need to update every test and wrap these assertions in assertBusy. Most tests are performing a query and then invoking assertHitCount that takes the response, so we can't wrap this at a higher level.

An alternative here, is we could the SearchType to use of Preference.PRIMARY_FIRST inside of RandomizingClient when segrep is enabled on the index. This would mean assertions are made against a primary unless a replica node is specified. To ensure there was no issue during replication, we can assert replica segment state before/during teardown.

@Bukhtawar
Copy link
Collaborator

+1 to the overall issue
Primary node however might not be a good test for "replication" and my concern is we might a false sense of victory. However we can atleast start from here and work on truly building "replication" tests

@mch2
Copy link
Member

mch2 commented Jun 16, 2023

Primary node however might not be a good test for "replication" and my concern is we might a false sense of victory.

Agree we aren't loving the solution of forcing primary first. Also not loving some alternatives we've come up with which are.

  1. Wrapping all tests in assertbusy where required (oof)
  2. adding a new base method for assertHitCount that passes the client and req so it can be repeated. Essentially making it assertEventualCount... This isn't terrible but still requires a lot of tests to be updated. It will also not work for tests concurrently indexing. There are also other assertions made other than hit count and still leave us with flakiness.
  3. Hack the client to refresh & assert replicas are caught up before performing a search / after a refresh. This introduces side effects to the tests that could be dangerous/misleading. Also doesn't work for concurrent cases.

I think the safest approach here is to outline the most critical subset of tests and create separate SR versions of them. We've essentially done that with SegmentReplication based ITs, but we will need to audit that list for coverage.

@Rishikesh1159
Copy link
Member

As @mch2 mentioned in this issue #8109 we have outlined the most critical test packages that we need to run with Segment Replication in server module:

cluster
gateway
index
indexing
indices
ingest
recovery
remotestore
update

@Rishikesh1159
Copy link
Member

Few options for us to enable segment replication with these critical test packages:

Previously pointed out options:
-> Wrapping all tests in assertbusy where required.
-> adding a new base method for assertHitCount that passes the client and req so it can be repeated. Essentially making it assertEventualCount... This isn't terrible but still requires a lot of tests to be updated. It will also not work for tests concurrently indexing. There are also other assertions made other than hit count and still leave us with flakiness.

Other options:
-> In every test after every index operation, we can call a method to verify if the store content of both primary and replica are same. As we do it here. - Problem here will be, we will need to modify every test and it might not work well for test using background indexing

-> Duplicating all test packages by creating separate Segment Replication versions of those test packages. - Problem here would be duplication lot of tests with very minor changes.

@dreamer-89 dreamer-89 added the v2.9.0 'Issues and PRs related to version v2.9.0' label Jul 5, 2023
@dreamer-89 dreamer-89 added v2.10.0 and removed v2.9.0 'Issues and PRs related to version v2.9.0' v2.8.0 'Issues and PRs related to version v2.8.0' labels Jul 19, 2023
@Rishikesh1159
Copy link
Member

Rishikesh1159 commented Jul 26, 2023

First we need to come up with a detailed design of the problem and explore few options/solutions. Once we have detailed plan and a working solution we need to come up with a plan of action to verify all integ tests passes.

Detailed Design of problem and proposed solution:

Background:

→ First let’s do a quick recap of how segment replication works.

  • When a document is indexed/write into opensearch we first index that document into primary shard (store them as segments in primary shards). Then depending on refresh interval, refresh happens on primary which makes document ready to be searchable by user/client. On a refresh, primary also publishes checkpoint to all replica shards and then the replica copies segments from primary. Until this copy event is completed (which usually takes some time to finish) the document is not searchable on replica shard.

Problem:

The main problem with running all integ tests with segrep enabled is tests failing with assertion on replica shard before it has caught up with primary.

Brute Force Solution:

→ As the problem above states we need to figure out a way to wait until replica shard has caught up with primary and then make assertions on tests.
→ Brute force approach for this would be to go to every integration test and add manual step to wait until replica has caught up with primary before every assertion in an integ test.
→ The problem with brute force approach is there are few thousands of integ tests in our current codebase and all of them need to be manually updated with wait until before assertions in tests. Also every new integ test added into opensearch codebase should have these wait until checks before assertions to be compatible with segment replication.

Proposed Solution:

→ Usually before assertions on replica shard, the client performs a refresh operation or search operation in integ test.
→ Approach is we create a new SegRep client similar to Randomized Client, and then we override the search or refresh methods in client to wait until replica shard has caught up (using _cat/segment_replication API) before returning back to integ test.
→ All the integ tests in opensearch codebase in one way or another inherit from single parent class OpenSearchIntegTestCase. So we can use this new SegRep client in OpenSearchIntegTestCase class so that all integ tests inheriting this class can be ran against segment replication (with waiting until replica catching up behaviour.)
→ I have created a branch: https://github.com/Rishikesh1159/OpenSearch/tree/test-all-integtests-segrep as POC to test this solution. I have modified that client’s search request method to wait until replica has caught up before getting back to integ Test.

Downside with proposed solution:

→ Not every integ test uses the client’s search requests for searching docs. Some integ tests use GET’s and MGET’s to search from translog. So integ tests using these GET’s and MGET’s requests will still fail with above proposed solution when segment replication is enabled. We already have an issue cut for GET’s and MGET’s here: #8536, this is being worked on independently.

→ When there is continuous/concurrent indexing with searching (assertions) in an integ tests, these tests might fail on few occasions as replica can be behind primary because of continuous ingestion and search. Our proposed solution might fail in this case. The best way to handle these kind of integ tests is to handle them individually test by test by adding wait until behaviour manually. As there are only very few tests of this kind we can update them manually if needed.

Plan of Action:

-> We need to verify all integration tests in opensearch repo pass with both segment replication and remote store. So, we can divide our testing plan into two phases:

  • Phase-1 : Segment Replication Only [Node to Node case].
  • Phase-2: Remote Store Using Segment Replication.

-> All Tests in opensearch repo do inherit from one of the base classes here. Here we can ignore all tests inheriting from both OpenSearchTestCase (unit tests) and OpenSearchSingleNodeTestCase (no replica case) for segment replication and remote store features. Most important base testing class for us would be OpenSearchIntegTestCase as all integ tests inherit from this base class. Finally we should also make sure that segment replication and remote store haven't broken any modules by verifying base classes OpenSearchRestTestCase and OpenSearchClientYamlSuiteTestCase. Our target for plan of action would be verifying all test classes inheriting following base classes pass:

  • OpenSearchIntegTestCase
  • OpenSearchRestTestCase
  • OpenSearchClientYamlSuiteTestCase

-> Initially we will execute our plan of action first on feature branch segment-replication. Once we have enough confidence we can merge this feature branch to main branch.

### Phase - 1 (Segment Replication - Node to Node):

### Phase - 2 (Remote Store With Segment Replication):

@Rishikesh1159
Copy link
Member

Each point in Plan of Action of above comment can be a separate sub task.

@dblock
Copy link
Member

dblock commented Jul 27, 2023

Each point in Plan of Action of above comment can be a separate sub task.

You can use - [ ] in the issue summary and GitHub will tell you what's left :)

@anasalkouz
Copy link
Member

Thanks @Rishikesh1159 for the detailed plan. Is the proposed solution will work for other modules as well (not server module)? are you going to cover this as part of your POC?

@Rishikesh1159
Copy link
Member

Rishikesh1159 commented Jan 8, 2024

Plan For running integration tests with segment replication.

Run all the existing Integ Tests in opensearch with segment replication enabled. Initially this was the end goal but soon we realized this is not good idea as there are many Integ Tests which will obviously fail with segrep and there are other set of tests which are completely unrelated to segment replication. Running these sort of tests with segment replication enabled wouldn’t add any value. Instead it might lead to adding more flaky tests. So we are targeting only specific modules to run with segrep initially and then later we can extend this to other modules if needed.

Goal:

We are targeting only specific modules that are related to indexing/replication to run with segrep initially and then later we can extend this to other modules if needed.

Following are 4 steps necessary :

Step 1 (waiting until replica):

→ Coming up with logic of waiting until replica caught up. This logic can be found here

Step 2 (Mechanism to implement waiting until replica):

→ Next we need to come up with an mechanism/approach to use logic in step 1 , so that it can be plugged in at one place and used by multiple tests.
→ Possible Approches:

  • Approach 1: Modify the existing client to override search requests. Similar to what we discussed before here about segrep client.
  • Problem with Approach 1: In this approach we block search requests on client side until replica caught up. We are basically modifying existing client behaviour based on segrep settings. Tests/users writing integ tests usually don’t expect the client behaviour to wait. So we will be adding unintended behaviour on client for all segrep tests.
  • Approach 2: Modify Indexing Helper methods to add step 1 logic. After finishing an indexing request and refresh if segrep is enabled on index then we add waiting until replica caught up (step 1) logic. Usually in all integ tests to perform index operation we use one of the following way to index docs:
    -> using helper method indexRandom(), example. Many IntegTests use indexRandom() to index docs, so just modifying the indexRandom() in parent class OpensearchIntegTestClass.java would cover lot of tests.
    -> using any of the index() helper method, example.
    -> directly using client, client().prepareIndex("test"), example.
  • Problem with Approach 2: Although many tests use indexRandom() but there are still some tests using client().prepareIndex("test"), index(). So for tests not using indexrandom() we either change the test to use indexRandom or we add waiting until (step 1) logic in test itself after a refresh.

Step 3 (Framework/Mechanism for running a test with both default (docrep) and segrep enabled) :

→ Next we need to come up with an approach to run existing tests with segrep enabled and segrep disabled.
→ Possible Approach:

  • Approach 1: Similar to existing concurrent search using parameterization of test class. We need to modify the base parameterization class to support final index settings, as segment replication setting is final.
  • Problem with Approach 1: Tests with default scope/SCOPE.SUITE using same Index across all the tests in suite will fail to run with both segrep enabled and disabled. As segrep setting is final it can be set only once while index being created. We also need to go through every test suite and make some changes to support parameterization for those classes.
  • Approach 2: Create a new segrep version of every test by extending the original test and modifying settings.
  • Problem with Approach 2: Lots of extra new classes will be added to the existing test modules as we are basically creating another class for every test to run with segrep.

Step 4:

→ Identify and list out all the tests/modules that definitely need to be run with segment replication.

→ Implementation of all above 4 steps.

@andrross
Copy link
Member

@reta @sohami I would appreciate your feedback on this as I know you were involved in the parameterization of the concurrent search integration tests. Specifically, I'd like you're feedback on "Step 3" of the previous comment, which is the mechanism for parameterizing the test cases. You can see the implementation in #11773. Basically, because segment replication is a non-dynamic setting for an index, we're going with the inheritance approach in order to accommodate suite scope tests.

@reta
Copy link
Collaborator

reta commented Jan 10, 2024

Basically, because segment replication is a non-dynamic setting for an index, we're going with the inheritance approach in order to accommodate suite scope tests.

@andrross I think we could make parameterized tests work with ClusterScope annotation (you are very right that it does not work right now as one would expect) but it as of today, going with inheritance seems to be the most straightforward way to start with (I will try to look into parameterized tests later this week and will update you).

@reta
Copy link
Collaborator

reta commented Jan 12, 2024

@andrross have this draft opened #11877, at a glance seems to be feasible but the pull request still needs some work.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Segment Replication Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing:Replication Issues and PRs related to core replication framework eg segrep >test-failure Test failure from CI, local build, etc. v2.11.0 Issues and PRs related to version 2.11.0
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

9 participants