Concurrent Search Operational Readiness #12118

andrross · 2024-02-01T01:26:10Z

Note: The intent of this issue is twofold: to collect and document in a single place all the operational readiness work that went into the concurrent search feature in order to demonstrate that it is ready for general availability, and to dry run a more generic checklist that can be extracted and used as a process for releasing large features. In the future, this checklist/process would be referenced and incrementally completed throughout development, but in this particular case I'm using the concurrent search feature as a sort of dry-run of this operational readiness procedure. I'm looking for feedback on this process itself (in addition to the specific concurrent search content).

Dependencies

Enumerate all your dependencies, highlighting any new dependencies.

No new dependencies are added by concurrent search.

Are any plugins impacted by your change?

Concurrent search is backward compatible with plugins. However, plugins that implement an aggregator must make changes in order to support concurrent search. If an aggregator plugin does not support concurrent search then the system will fall back to the non-current behavior to preserve compatibility. See the documetation for more details.
Additionally, plugins that query indexes may utilize concurrent search if the indexes have been configured to use concurrent search, but search behavior should be identical to the non-concurrent case. System indexes (i.e. indexes created by plugins for internal system usage) will not have concurrent search enabled by default.

Can your feature independently be disabled?

Yes. This feature is enabled by either a per-index dynamic setting or a cluster-wide dynamic setting.

Have you added comprehensive user documentation on the documentation website?

Yes: https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/

Have you documented any expert-level settings that can be exercised by an operator?

Yes: https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#slicing-mechanisms

Failure Modes

Enumerate the list of failure modes or threats for the feature. Consider thinking of threats as unknown failures, e.g. anything that could possibly go wrong and lead to availability or durability loss. For each failure mode, list all available mitigations.

Concurrent search does not introduce any new dependencies or inter-node interactions. However, it is a significant change to a mission-critical code path (search). High level failure modes fall into two camps: performance and correctness.

Mitigations for Performance Issues
- Concurrent search is dynamically configurable. We have documentation detailing how to configure concurrent search. If a critical performance regression is identified in a live production cluster, then concurrent search can be disabled for a given index or all indexes.
Mitigations for Correctness Issues
- Same as above, concurrent search can be disabled if a correctness issue is discovered in production.
- Extensive integration testing has been enabled. New tests were written for concurrent-search specific code. Existing tests were parameterized to run both concurrent and non-concurrent cases in order to prove that concurrent search provides the same behavior as non-concurrent search.

Testing

Integration Tests

Have you added comprehensive integration tests that are run by default as a part of the `check` gradle task?

Yes: #7440

Do you have any tests that are currently labeled as flaky?

Flaky Test Project Board. This will be completed prior to release (one outstanding issue remains as of this writing).

Do you have any tests that rely on the test-retry plugin to retry on flaky failures?

No.

Do you have any tests that are disabled with the `@AwaitsFix` annotation?

No.

Scaling Tests

Have you tested with large clusters (100+ nodes)?

No. Concurrent search is a shard-level feature, so the scaling properties of a multi-node cluster do not fundamentally change with this feature.

Have you tested with variable shard numbers and sizes?

Results to be published here

Chaos Tests

Have you considered simulating faults to mimic hardware failures, network failures(packet loss) in each of your critical request paths?

Concurrent search does not introduce any new dependencies or failure points. Node behavior is not expected to be any different in the case of hardware failures. Many existing integration tests that focus on edge cases have been parameterized to run with concurrent search.

Performance Tests

Have you enabled your feature to be tested in OpenSearch Benchmark?

Yes. OpenSearch Benchmark has the ability to provide index settings, which is how concurrent search is enabled.

Have you enabled your feature to be a part of nightly benchmarking runs?

Yes

Share all performance data

Overall performance meta issue.

Regression/Rollback :

Can your feature cause a functionality or feature regression if the feature is disabled?

No. In the case that the setting is disabled, then the existing non-concurrent search path is exercised. Concurrent search has been present but disabled via the feature flag mechanism for multiple minor releases and has been shown to not cause a regression.

How have you validated that the feature rollback works?

Integration tests swap between concurrent and non-concurrent search cases by using the dynamic setting. A "rollback" in this context is to simply change the index setting to disable concurrent search.

Diagnostics

When a failure (known or unknown) happens on the system, do you have sufficient instrumentation to debug?

Yes. Logging exists at various levels and has been utilized during development to find bugs.

User Facing API/settings

What are the new REST APIs to be added by this feature?

https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#api-changes

What are the new settings to be added by this feature?

https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#disabling-concurrent-search-at-the-index-or-cluster-level

Metrics, Notifications & Visibility

What are the user facing metrics?

What actions needs to be taken by the customer if those metrics increase a threshold? Are there recommended alarms users need to set up?

Guidance around general search performance monitoring remains unchanged and there are no new recommended alarms. The new stats provide concurrent search-specific insights into query performance and will be useful when doing intensive search performance tuning and analysis.

What are the metrics granularity (node level, index level, cluster level, etc.)?

There are index level and node level stats for concurrent search, which is appropriate for this feature.

Related component

Search:Performance

The text was updated successfully, but these errors were encountered:

Pallavi-AWS · 2024-02-06T21:50:44Z

Thanks @andrross, this is great. Can we generalize this as a issue template and start enforcing operational readiness early in the feature delivery cycle?

andrross · 2024-02-09T18:43:53Z

@Pallavi-AWS

Can we generalize this as a issue template

Yes, absolutely! I hope this can serve as an example of a generalized template.

start enforcing operational readiness early in the feature delivery cycle?

This point is what I would love to start a conversation on with the broader community. I definitely don't want to impose more processes that are difficult for a newbie to discover and follow. On the other hand, OpenSearch is a complex distributed system and I think something like this can help folks build safe, correct, and performant features.

sohami · 2024-02-13T01:51:04Z

Thanks @andrross for creating this template. Few suggestions that I can think of:

We should have an Overview section to provide summary of the feature with references to meta issue and design for the feature.
Security considerations made for the feature such as adding support in security plugin for new APIs, any default security roles that can be added or any change in behavior of existing roles.
In listing Failure Modes, adding an explicit section for any dependency related failures and the behavior of the system or feature.
Probably also add a section for Data Correctness and Durability considerations if feature/change directly deal with writing or modifying data.
In Scaling Tests, probably also adding multi client scenarios for search/indexing path depending on the feature.
Also adding a section of version upgrade test. Could be part of regression/rollback section
Should we have a section for any callouts which feature developer or owner would like to bring to the attention of the community. For example: In case of concurrent segment search, we are adding a Other considerations section which talks about some of these callouts.
We use labels to triage the issues, for new functionality probably existing labels may not be relevant. So having a section to see if any triage labels needed to be added or which existing one should be used for the feature could be useful.

andrross · 2024-05-31T20:46:31Z

Closing this issue as concurrent search has been released.

andrross added bug Something isn't working untriaged labels Feb 1, 2024

github-actions bot added the Search:Performance label Feb 1, 2024

github-project-automation bot added this to Search Project Board Feb 1, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 1, 2024

andrross mentioned this issue Feb 1, 2024

[Concurrent Segment Search][Meta] GA readiness items #9100

Closed

jed326 added this to Concurrent Search Feb 1, 2024

github-project-automation bot moved this to Todo in Concurrent Search Feb 1, 2024

anasalkouz removed the untriaged label Feb 2, 2024

andrross added discuss Issues intended to help drive brainstorming and decision making and removed bug Something isn't working labels Feb 8, 2024

andrross closed this as completed May 31, 2024

github-project-automation bot moved this from Todo to Done in Concurrent Search May 31, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent Search Operational Readiness #12118

Concurrent Search Operational Readiness #12118

andrross commented Feb 1, 2024 •

edited

Loading

Pallavi-AWS commented Feb 6, 2024

andrross commented Feb 9, 2024

sohami commented Feb 13, 2024

andrross commented May 31, 2024

Concurrent Search Operational Readiness #12118

Concurrent Search Operational Readiness #12118

Comments

andrross commented Feb 1, 2024 • edited Loading

Dependencies

Enumerate all your dependencies, highlighting any new dependencies.

Are any plugins impacted by your change?

Can your feature independently be disabled?

Have you added comprehensive user documentation on the documentation website?

Have you documented any expert-level settings that can be exercised by an operator?

Failure Modes

Enumerate the list of failure modes or threats for the feature. Consider thinking of threats as unknown failures, e.g. anything that could possibly go wrong and lead to availability or durability loss. For each failure mode, list all available mitigations.

Testing

Integration Tests

Have you added comprehensive integration tests that are run by default as a part of the check gradle task?

Do you have any tests that are currently labeled as flaky?

Do you have any tests that rely on the test-retry plugin to retry on flaky failures?

Do you have any tests that are disabled with the @AwaitsFix annotation?

Scaling Tests

Have you tested with large clusters (100+ nodes)?

Have you tested with variable shard numbers and sizes?

Chaos Tests

Have you considered simulating faults to mimic hardware failures, network failures(packet loss) in each of your critical request paths?

Performance Tests

Have you enabled your feature to be tested in OpenSearch Benchmark?

Have you enabled your feature to be a part of nightly benchmarking runs?

Share all performance data

Regression/Rollback :

Can your feature cause a functionality or feature regression if the feature is disabled?

How have you validated that the feature rollback works?

Diagnostics

When a failure (known or unknown) happens on the system, do you have sufficient instrumentation to debug?

User Facing API/settings

What are the new REST APIs to be added by this feature?

What are the new settings to be added by this feature?

Metrics, Notifications & Visibility

What are the user facing metrics?

What actions needs to be taken by the customer if those metrics increase a threshold? Are there recommended alarms users need to set up?

What are the metrics granularity (node level, index level, cluster level, etc.)?

Related component

Pallavi-AWS commented Feb 6, 2024

andrross commented Feb 9, 2024

sohami commented Feb 13, 2024

andrross commented May 31, 2024

andrross commented Feb 1, 2024 •

edited

Loading

Have you added comprehensive integration tests that are run by default as a part of the `check` gradle task?

Do you have any tests that are disabled with the `@AwaitsFix` annotation?