-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Meta] [Segment Replication] Run all integration tests with segment replication enabled #6761
Comments
We are still working on enabling SR tests with randomization across our entire suite. The issue with randomly running SR for all tests is that assertions do not wait for replication to complete. So we consistently see doc count assertions break. To fix this we would need to update every test and wrap these assertions in assertBusy. Most tests are performing a query and then invoking assertHitCount that takes the response, so we can't wrap this at a higher level. An alternative here, is we could the SearchType to use of |
+1 to the overall issue |
Agree we aren't loving the solution of forcing primary first. Also not loving some alternatives we've come up with which are.
I think the safest approach here is to outline the most critical subset of tests and create separate SR versions of them. We've essentially done that with SegmentReplication based ITs, but we will need to audit that list for coverage. |
Few options for us to enable segment replication with these critical test packages: Previously pointed out options: Other options: -> Duplicating all test packages by creating separate Segment Replication versions of those test packages. - Problem here would be duplication lot of tests with very minor changes. |
First we need to come up with a detailed design of the problem and explore few options/solutions. Once we have detailed plan and a working solution we need to come up with a plan of action to verify all integ tests passes. Detailed Design of problem and proposed solution:Background:→ First let’s do a quick recap of how segment replication works.
Problem:The main problem with running all integ tests with segrep enabled is tests failing with assertion on replica shard before it has caught up with primary. Brute Force Solution:→ As the problem above states we need to figure out a way to wait until replica shard has caught up with primary and then make assertions on tests. Proposed Solution:→ Usually before assertions on replica shard, the client performs a refresh operation or search operation in integ test. Downside with proposed solution:→ Not every integ test uses the client’s search requests for searching docs. Some integ tests use GET’s and MGET’s to search from translog. So integ tests using these GET’s and MGET’s requests will still fail with above proposed solution when segment replication is enabled. We already have an issue cut for GET’s and MGET’s here: #8536, this is being worked on independently. → When there is continuous/concurrent indexing with searching (assertions) in an integ tests, these tests might fail on few occasions as replica can be behind primary because of continuous ingestion and search. Our proposed solution might fail in this case. The best way to handle these kind of integ tests is to handle them individually test by test by adding wait until behaviour manually. As there are only very few tests of this kind we can update them manually if needed. Plan of Action:-> We need to verify all integration tests in opensearch repo pass with both segment replication and remote store. So, we can divide our testing plan into two phases:
-> All Tests in opensearch repo do inherit from one of the base classes here. Here we can ignore all tests inheriting from both
-> Initially we will execute our plan of action first on feature branch segment-replication. Once we have enough confidence we can merge this feature branch to main branch. ### Phase - 1 (Segment Replication - Node to Node):
### Phase - 2 (Remote Store With Segment Replication):
|
Each point in Plan of Action of above comment can be a separate sub task. |
You can use |
Thanks @Rishikesh1159 for the detailed plan. Is the proposed solution will work for other modules as well (not server module)? are you going to cover this as part of your POC? |
Plan For running integration tests with segment replication.Run all the existing Integ Tests in opensearch with segment replication enabled. Initially this was the end goal but soon we realized this is not good idea as there are many Integ Tests which will obviously fail with segrep and there are other set of tests which are completely unrelated to segment replication. Running these sort of tests with segment replication enabled wouldn’t add any value. Instead it might lead to adding more flaky tests. So we are targeting only specific modules to run with segrep initially and then later we can extend this to other modules if needed. Goal:We are targeting only specific modules that are related to indexing/replication to run with segrep initially and then later we can extend this to other modules if needed. Following are 4 steps necessary : Step 1 (waiting until replica):→ Coming up with logic of waiting until replica caught up. This logic can be found here Step 2 (Mechanism to implement waiting until replica):→ Next we need to come up with an mechanism/approach to use logic in step 1 , so that it can be plugged in at one place and used by multiple tests.
Step 3 (Framework/Mechanism for running a test with both default (docrep) and segrep enabled) :→ Next we need to come up with an approach to run existing tests with segrep enabled and segrep disabled.
Step 4:→ Identify and list out all the tests/modules that definitely need to be run with segment replication. → Implementation of all above 4 steps. |
@reta @sohami I would appreciate your feedback on this as I know you were involved in the parameterization of the concurrent search integration tests. Specifically, I'd like you're feedback on "Step 3" of the previous comment, which is the mechanism for parameterizing the test cases. You can see the implementation in #11773. Basically, because segment replication is a non-dynamic setting for an index, we're going with the inheritance approach in order to accommodate suite scope tests. |
@andrross I think we could make parameterized tests work with |
Leading up to GA #5147 of segment replication, we need to do a round of sanity testing with existing integration tests. This is to ensure segment replication is compatible with existing features.
This issue is to track effort on running these tests, identifying root cause of failures. The fix for individual failures can be tracked in separate issues. These tests should be RUN against
2.x
branch as we targetting this exercise for SegRep GA (going in 2.7). We can start with server module and run all integration testsinternalClusterTest
. Once these failures are resolved, we can run remaining integration tests.General steps:
When running these tests, we need to turn Segment Replication as default replication strategy (this is to overcome fixing replication type for each integration test). Below are the changes needed for this:
1 - update INDEX_REPLICATION_TYPE_SETTING in IndexMetadata to return ReplicationType.SEGMENT.
2 - Update ReplicaitonType.java to return SEGMENT if a NPE is caught.
3 - Update FeatureFlags to set REPLICATION_TYPE_SETTING’s default to true.
4 - Some tests use mockEngineFactory, change OpenSearchIntegTestCase#addMockInternalEngine to return false from this method by default.
5 - Some methods use MockEngineFactory as a node level plugin - override this by also updating the class
The text was updated successfully, but these errors were encountered: