diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 815687fa17..61a535fe91 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1 +1 @@ -* @kolchfa-aws @Naarcha-AWS @vagimeli @AMoo-Miki @natebower @dlvenable @stephen-crawford @epugh +* @kolchfa-aws @Naarcha-AWS @AMoo-Miki @natebower @dlvenable @epugh diff --git a/.github/workflows/pr_checklist.yml b/.github/workflows/pr_checklist.yml index b56174793e..e34d0cecb2 100644 --- a/.github/workflows/pr_checklist.yml +++ b/.github/workflows/pr_checklist.yml @@ -29,7 +29,7 @@ jobs: with: script: | let assignee = context.payload.pull_request.user.login; - const prOwners = ['Naarcha-AWS', 'kolchfa-aws', 'vagimeli', 'natebower']; + const prOwners = ['Naarcha-AWS', 'kolchfa-aws', 'natebower']; if (!prOwners.includes(assignee)) { assignee = 'kolchfa-aws' @@ -40,4 +40,4 @@ jobs: owner: context.repo.owner, repo: context.repo.repo, assignees: [assignee] - }); \ No newline at end of file + }); diff --git a/.ruby-version b/.ruby-version deleted file mode 100644 index 4772543317..0000000000 --- a/.ruby-version +++ /dev/null @@ -1 +0,0 @@ -3.3.2 diff --git a/MAINTAINERS.md b/MAINTAINERS.md index 55b908e027..b06d367e21 100644 --- a/MAINTAINERS.md +++ b/MAINTAINERS.md @@ -9,14 +9,14 @@ This document lists the maintainers in this repo. See [opensearch-project/.githu | Fanit Kolchina | [kolchfa-aws](https://github.com/kolchfa-aws) | Amazon | | Nate Archer | [Naarcha-AWS](https://github.com/Naarcha-AWS) | Amazon | | Nathan Bower | [natebower](https://github.com/natebower) | Amazon | -| Melissa Vagi | [vagimeli](https://github.com/vagimeli) | Amazon | | Miki Barahmand | [AMoo-Miki](https://github.com/AMoo-Miki) | Amazon | | David Venable | [dlvenable](https://github.com/dlvenable) | Amazon | -| Stephen Crawford | [stephen-crawford](https://github.com/stephen-crawford) | Amazon | | Eric Pugh | [epugh](https://github.com/epugh) | OpenSource Connections | ## Emeritus -| Maintainer | GitHub ID | Affiliation | -| ---------------- | ----------------------------------------------- | ----------- | -| Heather Halter | [hdhalter](https://github.com/hdhalter) | Amazon | +| Maintainer | GitHub ID | Affiliation | +| ---------------- | ------------------------------------------------------- | ----------- | +| Heather Halter | [hdhalter](https://github.com/hdhalter) | Amazon | +| Melissa Vagi | [vagimeli](https://github.com/vagimeli) | Amazon | +| Stephen Crawford | [stephen-crawford](https://github.com/stephen-crawford) | Amazon | \ No newline at end of file diff --git a/README.md b/README.md index 52321335c7..807e106309 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,6 @@ If you encounter problems or have questions when contributing to the documentati - [kolchfa-aws](https://github.com/kolchfa-aws) - [Naarcha-AWS](https://github.com/Naarcha-AWS) -- [vagimeli](https://github.com/vagimeli) ## Code of conduct diff --git a/_about/version-history.md b/_about/version-history.md index bfbd8e9f55..d1cf98c178 100644 --- a/_about/version-history.md +++ b/_about/version-history.md @@ -34,6 +34,7 @@ OpenSearch version | Release highlights | Release date [2.0.1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.1.md) | Includes bug fixes and maintenance updates for Alerting and Anomaly Detection. | 16 June 2022 [2.0.0](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0.md) | Includes document-level monitors for alerting, OpenSearch Notifications plugins, and Geo Map Tiles in OpenSearch Dashboards. Also adds support for Lucene 9 and bug fixes for all OpenSearch plugins. For a full list of release highlights, see the Release Notes. | 26 May 2022 [2.0.0-rc1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0-rc1.md) | The Release Candidate for 2.0.0. This version allows you to preview the upcoming 2.0.0 release before the GA release. The preview release adds document-level alerting, support for Lucene 9, and the ability to use term lookup queries in document level security. | 03 May 2022 +[1.3.20](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.20.md) | Includes enhancements to Anomaly Detection Dashboards, bug fixes for Alerting and Dashboards Reports, and maintenance updates for several OpenSearch components. | 11 December 2024 [1.3.19](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.19.md) | Includes bug fixes and maintenance updates for OpenSearch security, OpenSearch security Dashboards, and anomaly detection. | 27 August 2024 [1.3.18](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.18.md) | Includes maintenance updates for OpenSearch security. | 16 July 2024 [1.3.17](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.17.md) | Includes maintenance updates for OpenSearch security and OpenSearch Dashboards security. | 06 June 2024 diff --git a/_analyzers/custom-analyzer.md b/_analyzers/custom-analyzer.md new file mode 100644 index 0000000000..c456f3d826 --- /dev/null +++ b/_analyzers/custom-analyzer.md @@ -0,0 +1,312 @@ +--- +layout: default +title: Creating a custom analyzer +nav_order: 40 +parent: Analyzers +--- + +# Creating a custom analyzer + +To create a custom analyzer, specify a combination of the following components: + +- Character filters (zero or more) + +- Tokenizer (one) + +- Token filters (zero or more) + +## Configuration + +The following parameters can be used to configure a custom analyzer. + +| Parameter | Required/Optional | Description | +|:--- | :--- | :--- | +| `type` | Optional | The analyzer type. Default is `custom`. You can also specify a prebuilt analyzer using this parameter. | +| `tokenizer` | Required | A tokenizer to be included in the analyzer. | +| `char_filter` | Optional | A list of character filters to be included in the analyzer. | +| `filter` | Optional | A list of token filters to be included in the analyzer. | +| `position_increment_gap` | Optional | The extra spacing applied between values when indexing text fields that have multiple values. For more information, see [Position increment gap](#position-increment-gap). Default is `100`. | + +## Examples + +The following examples demonstrate various custom analyzer configurations. + +### Custom analyzer with a character filter for HTML stripping + +The following example analyzer removes HTML tags from text before tokenization: + +```json +PUT simple_html_strip_analyzer_index +{ + "settings": { + "analysis": { + "analyzer": { + "html_strip_analyzer": { + "type": "custom", + "char_filter": ["html_strip"], + "tokenizer": "whitespace", + "filter": ["lowercase"] + } + } + } + } +} +``` +{% include copy-curl.html %} + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET simple_html_strip_analyzer_index/_analyze +{ + "analyzer": "html_strip_analyzer", + "text": "

OpenSearch is awesome!

" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "opensearch", + "start_offset": 3, + "end_offset": 13, + "type": "word", + "position": 0 + }, + { + "token": "is", + "start_offset": 14, + "end_offset": 16, + "type": "word", + "position": 1 + }, + { + "token": "awesome!", + "start_offset": 25, + "end_offset": 42, + "type": "word", + "position": 2 + } + ] +} +``` + +### Custom analyzer with a mapping character filter for synonym replacement + +The following example analyzer replaces specific characters and patterns before applying the synonym filter: + +```json +PUT mapping_analyzer_index +{ + "settings": { + "analysis": { + "analyzer": { + "synonym_mapping_analyzer": { + "type": "custom", + "char_filter": ["underscore_to_space"], + "tokenizer": "standard", + "filter": ["lowercase", "stop", "synonym_filter"] + } + }, + "char_filter": { + "underscore_to_space": { + "type": "mapping", + "mappings": ["_ => ' '"] + } + }, + "filter": { + "synonym_filter": { + "type": "synonym", + "synonyms": [ + "quick, fast, speedy", + "big, large, huge" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET mapping_analyzer_index/_analyze +{ + "analyzer": "synonym_mapping_analyzer", + "text": "The slow_green_turtle is very large" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + {"token": "slow","start_offset": 4,"end_offset": 8,"type": "","position": 1}, + {"token": "green","start_offset": 9,"end_offset": 14,"type": "","position": 2}, + {"token": "turtle","start_offset": 15,"end_offset": 21,"type": "","position": 3}, + {"token": "very","start_offset": 25,"end_offset": 29,"type": "","position": 5}, + {"token": "large","start_offset": 30,"end_offset": 35,"type": "","position": 6}, + {"token": "big","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6}, + {"token": "huge","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6} + ] +} +``` + +### Custom analyzer with a custom pattern-based character filter for number normalization + +The following example analyzer normalizes phone numbers by removing dashes and spaces and applies edge n-grams to the normalized text to support partial matches: + +```json +PUT advanced_pattern_replace_analyzer_index +{ + "settings": { + "analysis": { + "analyzer": { + "phone_number_analyzer": { + "type": "custom", + "char_filter": ["phone_normalization"], + "tokenizer": "standard", + "filter": ["lowercase", "edge_ngram"] + } + }, + "char_filter": { + "phone_normalization": { + "type": "pattern_replace", + "pattern": "[-\\s]", + "replacement": "" + } + }, + "filter": { + "edge_ngram": { + "type": "edge_ngram", + "min_gram": 3, + "max_gram": 10 + } + } + } + } +} +``` +{% include copy-curl.html %} + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET advanced_pattern_replace_analyzer_index/_analyze +{ + "analyzer": "phone_number_analyzer", + "text": "123-456 7890" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + {"token": "123","start_offset": 0,"end_offset": 12,"type": "","position": 0}, + {"token": "1234","start_offset": 0,"end_offset": 12,"type": "","position": 0}, + {"token": "12345","start_offset": 0,"end_offset": 12,"type": "","position": 0}, + {"token": "123456","start_offset": 0,"end_offset": 12,"type": "","position": 0}, + {"token": "1234567","start_offset": 0,"end_offset": 12,"type": "","position": 0}, + {"token": "12345678","start_offset": 0,"end_offset": 12,"type": "","position": 0}, + {"token": "123456789","start_offset": 0,"end_offset": 12,"type": "","position": 0}, + {"token": "1234567890","start_offset": 0,"end_offset": 12,"type": "","position": 0} + ] +} +``` + +## Position increment gap + +The `position_increment_gap` parameter sets a positional gap between terms when indexing multi-valued fields, such as arrays. This gap ensures that phrase queries don't match terms across separate values unless explicitly allowed. For example, a default gap of 100 specifies that terms in different array entries are 100 positions apart, preventing unintended matches in phrase searches. You can adjust this value or set it to `0` in order to allow phrases to span across array values. + +The following example demonstrates the effect of `position_increment_gap` using a `match_phrase` query. + +1. Index a document in a `test-index`: + + ```json + PUT test-index/_doc/1 + { + "names": [ "Slow green", "turtle swims"] + } + ``` + {% include copy-curl.html %} + +1. Query the document using a `match_phrase` query: + + ```json + GET test-index/_search + { + "query": { + "match_phrase": { + "names": { + "query": "green turtle" + } + } + } + } + ``` + {% include copy-curl.html %} + + The response returns no hits because the distance between the terms `green` and `turtle` is `100` (the default `position_increment_gap`). + +1. Now query the document using a `match_phrase` query with a `slop` parameter that is higher than the `position_increment_gap`: + + ```json + GET test-index/_search + { + "query": { + "match_phrase": { + "names": { + "query": "green turtle", + "slop": 101 + } + } + } + } + ``` + {% include copy-curl.html %} + + The response contains the matching document: + + ```json + { + "took": 4, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 1, + "relation": "eq" + }, + "max_score": 0.010358453, + "hits": [ + { + "_index": "test-index", + "_id": "1", + "_score": 0.010358453, + "_source": { + "names": [ + "Slow green", + "turtle swims" + ] + } + } + ] + } + } + ``` diff --git a/_analyzers/index.md b/_analyzers/index.md index def6563f3e..1dc38b2cd4 100644 --- a/_analyzers/index.md +++ b/_analyzers/index.md @@ -51,7 +51,7 @@ For a list of supported analyzers, see [Analyzers]({{site.url}}{{site.baseurl}}/ ## Custom analyzers -If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer. +If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer. For more information, see [Creating a custom analyzer]({{site.url}}{{site.baseurl}}/analyzers/custom-analyzer/). ## Text analysis at indexing time and query time diff --git a/_analyzers/language-analyzers/index.md b/_analyzers/language-analyzers/index.md index 89a4a42254..cc53c1cdac 100644 --- a/_analyzers/language-analyzers/index.md +++ b/_analyzers/language-analyzers/index.md @@ -1,7 +1,7 @@ --- layout: default title: Language analyzers -nav_order: 100 +nav_order: 140 parent: Analyzers has_children: true has_toc: true diff --git a/_analyzers/normalizers.md b/_analyzers/normalizers.md index b89659f814..52841d2571 100644 --- a/_analyzers/normalizers.md +++ b/_analyzers/normalizers.md @@ -1,7 +1,7 @@ --- layout: default title: Normalizers -nav_order: 100 +nav_order: 110 --- # Normalizers diff --git a/_analyzers/supported-analyzers/fingerprint.md b/_analyzers/supported-analyzers/fingerprint.md new file mode 100644 index 0000000000..267e16c039 --- /dev/null +++ b/_analyzers/supported-analyzers/fingerprint.md @@ -0,0 +1,115 @@ +--- +layout: default +title: Fingerprint analyzer +parent: Analyzers +nav_order: 60 +--- + +# Fingerprint analyzer + +The `fingerprint` analyzer creates a text fingerprint. The analyzer sorts and deduplicates the terms (tokens) generated from the input and then concatenates them using a separator. It is commonly used for data deduplication because it produces the same output for similar inputs containing the same words, regardless of word order. + +The `fingerprint` analyzer comprises the following components: + +- Standard tokenizer +- Lowercase token filter +- ASCII folding token filter +- Stop token filter +- Fingerprint token filter + +## Parameters + +The `fingerprint` analyzer can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`separator` | Optional | String | Specifies the character used to concatenate the terms after they have been tokenized, sorted, and deduplicated. Default is an empty space (` `). +`max_output_size` | Optional | Integer | Defines the maximum size of the output token. If the concatenated fingerprint exceeds this size, it will be truncated. Default is `255`. +`stopwords` | Optional | String or list of strings | A custom or predefined list of stopwords. Default is `_none_`. +`stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stopwords. + + +## Example + +Use the following command to create an index named `my_custom_fingerprint_index` with a `fingerprint` analyzer: + +```json +PUT /my_custom_fingerprint_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_custom_fingerprint_analyzer": { + "type": "fingerprint", + "separator": "-", + "max_output_size": 50, + "stopwords": ["to", "the", "over", "and"] + } + } + } + }, + "mappings": { + "properties": { + "my_field": { + "type": "text", + "analyzer": "my_custom_fingerprint_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_custom_fingerprint_index/_analyze +{ + "analyzer": "my_custom_fingerprint_analyzer", + "text": "The slow turtle swims over to the dog" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "dog-slow-swims-turtle", + "start_offset": 0, + "end_offset": 37, + "type": "fingerprint", + "position": 0 + } + ] +} +``` + +## Further customization + +If further customization is needed, you can define an analyzer with additional `fingerprint` analyzer components: + +```json +PUT /custom_fingerprint_analyzer +{ + "settings": { + "analysis": { + "analyzer": { + "custom_fingerprint": { + "tokenizer": "standard", + "filter": [ + "lowercase", + "asciifolding", + "fingerprint" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} diff --git a/_analyzers/supported-analyzers/index.md b/_analyzers/supported-analyzers/index.md index 43e41b8d6a..b54660478f 100644 --- a/_analyzers/supported-analyzers/index.md +++ b/_analyzers/supported-analyzers/index.md @@ -18,14 +18,14 @@ The following table lists the built-in analyzers that OpenSearch provides. The l Analyzer | Analysis performed | Analyzer output :--- | :--- | :--- -**Standard** (default) | - Parses strings into tokens at word boundaries
- Removes most punctuation
- Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] -**Simple** | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`] -**Whitespace** | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`] -**Stop** | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Removes stop words
- Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`] -**Keyword** (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`] -**Pattern** | - Parses strings into tokens using regular expressions
- Supports converting strings to lowercase
- Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] +[**Standard**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/standard/) (default) | - Parses strings into tokens at word boundaries
- Removes most punctuation
- Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] +[**Simple**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/simple/) | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`] +[**Whitespace**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/whitespace/) | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`] +[**Stop**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/stop/) | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Removes stop words
- Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`] +[**Keyword**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/keyword/) (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`] +[**Pattern**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/pattern/)| - Parses strings into tokens using regular expressions
- Supports converting strings to lowercase
- Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] [**Language**]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/index/) | Performs analysis specific to a certain language (for example, `english`). | [`fun`, `contribut`, `brand`, `new`, `pr`, `2`, `opensearch`] -**Fingerprint** | - Parses strings on any non-letter character
- Normalizes characters by converting them to ASCII
- Converts tokens to lowercase
- Sorts, deduplicates, and concatenates tokens into a single token
- Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`]
Note that the apostrophe was converted to its ASCII counterpart. +[**Fingerprint**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/fingerprint/) | - Parses strings on any non-letter character
- Normalizes characters by converting them to ASCII
- Converts tokens to lowercase
- Sorts, deduplicates, and concatenates tokens into a single token
- Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`]
Note that the apostrophe was converted to its ASCII counterpart. ## Language analyzers @@ -37,5 +37,5 @@ The following table lists the additional analyzers that OpenSearch supports. | Analyzer | Analysis performed | |:---------------|:---------------------------------------------------------------------------------------------------------| -| `phone` | An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) for parsing phone numbers. | -| `phone-search` | A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) for parsing phone numbers. | +| [`phone`]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/phone-analyzers/#the-phone-analyzer) | An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) for parsing phone numbers. | +| [`phone-search`]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/phone-analyzers/#the-phone-search-analyzer) | A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) for parsing phone numbers. | diff --git a/_analyzers/supported-analyzers/keyword.md b/_analyzers/supported-analyzers/keyword.md new file mode 100644 index 0000000000..00c314d0c4 --- /dev/null +++ b/_analyzers/supported-analyzers/keyword.md @@ -0,0 +1,78 @@ +--- +layout: default +title: Keyword analyzer +parent: Analyzers +nav_order: 80 +--- + +# Keyword analyzer + +The `keyword` analyzer doesn't tokenize text at all. Instead, it treats the entire input as a single token and does not break it into individual tokens. The `keyword` analyzer is often used for fields containing email addresses, URLs, or product IDs and in other cases where tokenization is not desirable. + +## Example + +Use the following command to create an index named `my_keyword_index` with a `keyword` analyzer: + +```json +PUT /my_keyword_index +{ + "mappings": { + "properties": { + "my_field": { + "type": "text", + "analyzer": "keyword" + } + } + } +} +``` +{% include copy-curl.html %} + +## Configuring a custom analyzer + +Use the following command to configure an index with a custom analyzer that is equivalent to the `keyword` analyzer: + +```json +PUT /my_custom_keyword_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_keyword_analyzer": { + "tokenizer": "keyword" + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_custom_keyword_index/_analyze +{ + "analyzer": "my_keyword_analyzer", + "text": "Just one token" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "Just one token", + "start_offset": 0, + "end_offset": 14, + "type": "word", + "position": 0 + } + ] +} +``` diff --git a/_analyzers/supported-analyzers/pattern.md b/_analyzers/supported-analyzers/pattern.md new file mode 100644 index 0000000000..bc3cb9a306 --- /dev/null +++ b/_analyzers/supported-analyzers/pattern.md @@ -0,0 +1,97 @@ +--- +layout: default +title: Pattern analyzer +parent: Analyzers +nav_order: 90 +--- + +# Pattern analyzer + +The `pattern` analyzer allows you to define a custom analyzer that uses a regular expression (regex) to split input text into tokens. It also provides options for applying regex flags, converting tokens to lowercase, and filtering out stopwords. + +## Parameters + +The `pattern` analyzer can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`pattern` | Optional | String | A [Java regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used to tokenize the input. Default is `\W+`. +`flags` | Optional | String | A string containing pipe-separated [Java regex flags](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary) that modify the behavior of the regular expression. +`lowercase` | Optional | Boolean | Whether to convert tokens to lowercase. Default is `true`. +`stopwords` | Optional | String or list of strings | A string specifying a predefined list of stopwords (such as `_english_`) or an array specifying a custom list of stopwords. Default is `_none_`. +`stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stopwords. + + +## Example + +Use the following command to create an index named `my_pattern_index` with a `pattern` analyzer: + +```json +PUT /my_pattern_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_pattern_analyzer": { + "type": "pattern", + "pattern": "\\W+", + "lowercase": true, + "stopwords": ["and", "is"] + } + } + } + }, + "mappings": { + "properties": { + "my_field": { + "type": "text", + "analyzer": "my_pattern_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_pattern_index/_analyze +{ + "analyzer": "my_pattern_analyzer", + "text": "OpenSearch is fast and scalable" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "opensearch", + "start_offset": 0, + "end_offset": 10, + "type": "word", + "position": 0 + }, + { + "token": "fast", + "start_offset": 14, + "end_offset": 18, + "type": "word", + "position": 2 + }, + { + "token": "scalable", + "start_offset": 23, + "end_offset": 31, + "type": "word", + "position": 4 + } + ] +} +``` diff --git a/_analyzers/supported-analyzers/phone-analyzers.md b/_analyzers/supported-analyzers/phone-analyzers.md index f24b7cf328..d94bfe192f 100644 --- a/_analyzers/supported-analyzers/phone-analyzers.md +++ b/_analyzers/supported-analyzers/phone-analyzers.md @@ -1,6 +1,6 @@ --- layout: default -title: Phone number +title: Phone number analyzers parent: Analyzers nav_order: 140 --- diff --git a/_analyzers/supported-analyzers/simple.md b/_analyzers/supported-analyzers/simple.md new file mode 100644 index 0000000000..29f8f9a533 --- /dev/null +++ b/_analyzers/supported-analyzers/simple.md @@ -0,0 +1,99 @@ +--- +layout: default +title: Simple analyzer +parent: Analyzers +nav_order: 100 +--- + +# Simple analyzer + +The `simple` analyzer is a very basic analyzer that breaks text into terms at non-letter characters and lowercases the terms. Unlike the `standard` analyzer, the `simple` analyzer treats everything except for alphabetic characters as delimiters, meaning that it does not recognize numbers, punctuation, or special characters as part of the tokens. + +## Example + +Use the following command to create an index named `my_simple_index` with a `simple` analyzer: + +```json +PUT /my_simple_index +{ + "mappings": { + "properties": { + "my_field": { + "type": "text", + "analyzer": "simple" + } + } + } +} +``` +{% include copy-curl.html %} + +## Configuring a custom analyzer + +Use the following command to configure an index with a custom analyzer that is equivalent to a `simple` analyzer with an added `html_strip` character filter: + +```json +PUT /my_custom_simple_index +{ + "settings": { + "analysis": { + "char_filter": { + "html_strip": { + "type": "html_strip" + } + }, + "tokenizer": { + "my_lowercase_tokenizer": { + "type": "lowercase" + } + }, + "analyzer": { + "my_custom_simple_analyzer": { + "type": "custom", + "char_filter": ["html_strip"], + "tokenizer": "my_lowercase_tokenizer", + "filter": ["lowercase"] + } + } + } + }, + "mappings": { + "properties": { + "my_field": { + "type": "text", + "analyzer": "my_custom_simple_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_custom_simple_index/_analyze +{ + "analyzer": "my_custom_simple_analyzer", + "text": "

The slow turtle swims over to dogs © 2024!

" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + {"token": "the","start_offset": 3,"end_offset": 6,"type": "word","position": 0}, + {"token": "slow","start_offset": 7,"end_offset": 11,"type": "word","position": 1}, + {"token": "turtle","start_offset": 12,"end_offset": 18,"type": "word","position": 2}, + {"token": "swims","start_offset": 19,"end_offset": 24,"type": "word","position": 3}, + {"token": "over","start_offset": 25,"end_offset": 29,"type": "word","position": 4}, + {"token": "to","start_offset": 30,"end_offset": 32,"type": "word","position": 5}, + {"token": "dogs","start_offset": 33,"end_offset": 37,"type": "word","position": 6} + ] +} +``` diff --git a/_analyzers/supported-analyzers/standard.md b/_analyzers/supported-analyzers/standard.md new file mode 100644 index 0000000000..d5c3650d5d --- /dev/null +++ b/_analyzers/supported-analyzers/standard.md @@ -0,0 +1,97 @@ +--- +layout: default +title: Standard analyzer +parent: Analyzers +nav_order: 50 +--- + +# Standard analyzer + +The `standard` analyzer is the default analyzer used when no other analyzer is specified. It is designed to provide a basic and efficient approach to generic text processing. + +This analyzer consists of the following tokenizers and token filters: + +- `standard` tokenizer: Removes most punctuation and splits text on spaces and other common delimiters. +- `lowercase` token filter: Converts all tokens to lowercase, ensuring case-insensitive matching. +- `stop` token filter: Removes common stopwords, such as "the", "is", and "and", from the tokenized output. + +## Example + +Use the following command to create an index named `my_standard_index` with a `standard` analyzer: + +```json +PUT /my_standard_index +{ + "mappings": { + "properties": { + "my_field": { + "type": "text", + "analyzer": "standard" + } + } + } +} +``` +{% include copy-curl.html %} + +## Parameters + +You can configure a `standard` analyzer with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`. +`stopwords` | Optional | String or list of strings | A string specifying a predefined list of stopwords (such as `_english_`) or an array specifying a custom list of stopwords. Default is `_none_`. +`stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stop words. + + +## Configuring a custom analyzer + +Use the following command to configure an index with a custom analyzer that is equivalent to the `standard` analyzer: + +```json +PUT /my_custom_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_custom_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "stop" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_custom_index/_analyze +{ + "analyzer": "my_custom_analyzer", + "text": "The slow turtle swims away" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + {"token": "slow","start_offset": 4,"end_offset": 8,"type": "","position": 1}, + {"token": "turtle","start_offset": 9,"end_offset": 15,"type": "","position": 2}, + {"token": "swims","start_offset": 16,"end_offset": 21,"type": "","position": 3}, + {"token": "away","start_offset": 22,"end_offset": 26,"type": "","position": 4} + ] +} +``` diff --git a/_analyzers/supported-analyzers/stop.md b/_analyzers/supported-analyzers/stop.md new file mode 100644 index 0000000000..df62c7fe58 --- /dev/null +++ b/_analyzers/supported-analyzers/stop.md @@ -0,0 +1,177 @@ +--- +layout: default +title: Stop analyzer +parent: Analyzers +nav_order: 110 +--- + +# Stop analyzer + +The `stop` analyzer removes a predefined list of stopwords. This analyzer consists of a `lowercase` tokenizer and a `stop` token filter. + +## Parameters + +You can configure a `stop` analyzer with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`stopwords` | Optional | String or list of strings | A string specifying a predefined list of stopwords (such as `_english_`) or an array specifying a custom list of stopwords. Default is `_english_`. +`stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stopwords. + +## Example + +Use the following command to create an index named `my_stop_index` with a `stop` analyzer: + +```json +PUT /my_stop_index +{ + "mappings": { + "properties": { + "my_field": { + "type": "text", + "analyzer": "stop" + } + } + } +} +``` +{% include copy-curl.html %} + +## Configuring a custom analyzer + +Use the following command to configure an index with a custom analyzer that is equivalent to a `stop` analyzer: + +```json +PUT /my_custom_stop_analyzer_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_custom_stop_analyzer": { + "tokenizer": "lowercase", + "filter": [ + "stop" + ] + } + } + } + }, + "mappings": { + "properties": { + "my_field": { + "type": "text", + "analyzer": "my_custom_stop_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_custom_stop_analyzer_index/_analyze +{ + "analyzer": "my_custom_stop_analyzer", + "text": "The large turtle is green and brown" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "large", + "start_offset": 4, + "end_offset": 9, + "type": "word", + "position": 1 + }, + { + "token": "turtle", + "start_offset": 10, + "end_offset": 16, + "type": "word", + "position": 2 + }, + { + "token": "green", + "start_offset": 20, + "end_offset": 25, + "type": "word", + "position": 4 + }, + { + "token": "brown", + "start_offset": 30, + "end_offset": 35, + "type": "word", + "position": 6 + } + ] +} +``` + +# Specifying stopwords + +The following example request specifies a custom list of stopwords: + +```json +PUT /my_new_custom_stop_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_custom_stop_analyzer": { + "type": "stop", + "stopwords": ["is", "and", "was"] + } + } + } + }, + "mappings": { + "properties": { + "description": { + "type": "text", + "analyzer": "my_custom_stop_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +The following example request specifies a path to the file containing stopwords: + +```json +PUT /my_new_custom_stop_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_custom_stop_analyzer": { + "type": "stop", + "stopwords_path": "stopwords.txt" + } + } + } + }, + "mappings": { + "properties": { + "description": { + "type": "text", + "analyzer": "my_custom_stop_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +In this example, the file is located in the config directory. You can also specify a full path to the file. \ No newline at end of file diff --git a/_analyzers/supported-analyzers/whitespace.md b/_analyzers/supported-analyzers/whitespace.md new file mode 100644 index 0000000000..4691b4f733 --- /dev/null +++ b/_analyzers/supported-analyzers/whitespace.md @@ -0,0 +1,87 @@ +--- +layout: default +title: Whitespace analyzer +parent: Analyzers +nav_order: 120 +--- + +# Whitespace analyzer + +The `whitespace` analyzer breaks text into tokens based only on white space characters (for example, spaces and tabs). It does not apply any transformations, such as lowercasing or removing stopwords, so the original case of the text is retained and punctuation is included as part of the tokens. + +## Example + +Use the following command to create an index named `my_whitespace_index` with a `whitespace` analyzer: + +```json +PUT /my_whitespace_index +{ + "mappings": { + "properties": { + "my_field": { + "type": "text", + "analyzer": "whitespace" + } + } + } +} +``` +{% include copy-curl.html %} + +## Configuring a custom analyzer + +Use the following command to configure an index with a custom analyzer that is equivalent to a `whitespace` analyzer with an added `lowercase` character filter: + +```json +PUT /my_custom_whitespace_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_custom_whitespace_analyzer": { + "type": "custom", + "tokenizer": "whitespace", + "filter": ["lowercase"] + } + } + } + }, + "mappings": { + "properties": { + "my_field": { + "type": "text", + "analyzer": "my_custom_whitespace_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_custom_whitespace_index/_analyze +{ + "analyzer": "my_custom_whitespace_analyzer", + "text": "The SLOW turtle swims away! 123" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + {"token": "the","start_offset": 0,"end_offset": 3,"type": "word","position": 0}, + {"token": "slow","start_offset": 4,"end_offset": 8,"type": "word","position": 1}, + {"token": "turtle","start_offset": 9,"end_offset": 15,"type": "word","position": 2}, + {"token": "swims","start_offset": 16,"end_offset": 21,"type": "word","position": 3}, + {"token": "away!","start_offset": 22,"end_offset": 27,"type": "word","position": 4}, + {"token": "123","start_offset": 28,"end_offset": 31,"type": "word","position": 5} + ] +} +``` diff --git a/_analyzers/token-filters/flatten-graph.md b/_analyzers/token-filters/flatten-graph.md new file mode 100644 index 0000000000..8d51c57400 --- /dev/null +++ b/_analyzers/token-filters/flatten-graph.md @@ -0,0 +1,109 @@ +--- +layout: default +title: Flatten graph +parent: Token filters +nav_order: 150 +--- + +# Flatten graph token filter + +The `flatten_graph` token filter is used to handle complex token relationships that occur when multiple tokens are generated at the same position in a graph structure. Some token filters, like `synonym_graph` and `word_delimiter_graph`, generate multi-position tokens---tokens that overlap or span multiple positions. These token graphs are useful for search queries but are not directly supported during indexing. The `flatten_graph` token filter resolves multi-position tokens into a linear sequence of tokens. Flattening the graph ensures compatibility with the indexing process. + +Token graph flattening is a lossy process. Whenever possible, avoid using the `flatten_graph` filter. Instead, apply graph token filters exclusively in search analyzers, removing the need for the `flatten_graph` filter. +{: .important} + +## Example + +The following example request creates a new index named `test_index` and configures an analyzer with a `flatten_graph` filter: + +```json +PUT /test_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_index_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "my_custom_filter", + "flatten_graph" + ] + } + }, + "filter": { + "my_custom_filter": { + "type": "word_delimiter_graph", + "catenate_all": true + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /test_index/_analyze +{ + "analyzer": "my_index_analyzer", + "text": "OpenSearch helped many employers" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "OpenSearch", + "start_offset": 0, + "end_offset": 10, + "type": "", + "position": 0, + "positionLength": 2 + }, + { + "token": "Open", + "start_offset": 0, + "end_offset": 4, + "type": "", + "position": 0 + }, + { + "token": "Search", + "start_offset": 4, + "end_offset": 10, + "type": "", + "position": 1 + }, + { + "token": "helped", + "start_offset": 11, + "end_offset": 17, + "type": "", + "position": 2 + }, + { + "token": "many", + "start_offset": 18, + "end_offset": 22, + "type": "", + "position": 3 + }, + { + "token": "employers", + "start_offset": 23, + "end_offset": 32, + "type": "", + "position": 4 + } + ] +} +``` diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index 14abeab567..875e94db5a 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -17,7 +17,7 @@ The following table lists all token filters that OpenSearch supports. Token filter | Underlying Lucene token filter| Description [`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it. [`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. -`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. +[`cjk_bigram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-bigram/) | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. [`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into their equivalent basic Latin characters.
- Folds half-width katakana character variants into their equivalent kana characters. [`classic`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/classic) | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms. [`common_grams`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/common_gram/) | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. @@ -29,18 +29,18 @@ Token filter | Underlying Lucene token filter| Description [`edge_ngram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/edge-ngram/) | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token. [`elision`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/elision/) | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane). [`fingerprint`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/fingerprint/) | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token. -`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. +[`flatten_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/flatten-graph/) | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing. [`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell allows a word to have multiple stems, this filter can emit multiple tokens for each consumed token. Requires the configuration of one or more language-specific Hunspell dictionaries. [`hyphenation_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hyphenation-decompounder/) | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. [`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. [`keep_words`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-words/) | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. [`keyword_marker`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keyword-marker/) | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. -`keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword. -`kstem` | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary. -`kuromoji_completion` | [JapaneseCompletionFilter](https://lucene.apache.org/core/9_10_0/analysis/kuromoji/org/apache/lucene/analysis/ja/JapaneseCompletionFilter.html) | Adds Japanese romanized terms to the token stream (in addition to the original tokens). Usually used to support autocomplete on Japanese search terms. Note that the filter has a `mode` parameter, which should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins). -`length` | [LengthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html) | Removes tokens whose lengths are shorter or longer than the length range specified by `min` and `max`. -`limit` | [LimitTokenCountFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html) | Limits the number of output tokens. A common use case is to limit the size of document field values based on token count. -`lowercase` | [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) is for the English language. You can set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)). +[`keyword_repeat`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keyword-repeat/) | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword. +[`kstem`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/kstem/) | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides KStem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary. +[`kuromoji_completion`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/kuromoji-completion/) | [JapaneseCompletionFilter](https://lucene.apache.org/core/9_10_0/analysis/kuromoji/org/apache/lucene/analysis/ja/JapaneseCompletionFilter.html) | Adds Japanese romanized terms to a token stream (in addition to the original tokens). Usually used to support autocomplete of Japanese search terms. Note that the filter has a `mode` parameter that should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins). +[`length`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/length/) | [LengthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html) | Removes tokens that are shorter or longer than the length range specified by `min` and `max`. +[`limit`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/limit/) | [LimitTokenCountFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html) | Limits the number of output tokens. For example, document field value sizes can be limited based on the token count. +[`lowercase`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/lowercase/) | [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) processes the English language. To process other languages, set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)). [`min_hash`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/min-hash/) | [MinHashFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/minhash/MinHashFilter.html) | Uses the [MinHash technique](https://en.wikipedia.org/wiki/MinHash) to estimate document similarity. Performs the following operations on a token stream sequentially:
1. Hashes each token in the stream.
2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket.
3. Outputs the smallest hash from each bucket as a token stream. [`multiplexer`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/multiplexer/) | N/A | Emits multiple tokens at the same position. Runs each token through each of the specified filter lists separately and outputs the results as separate tokens. [`ngram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/ngram/) | [NGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html) | Tokenizes the given token into n-grams of lengths between `min_gram` and `max_gram`. @@ -51,17 +51,17 @@ Token filter | Underlying Lucene token filter| Description [`porter_stem`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/porter-stem/) | [PorterStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language. [`predicate_token_filter`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/predicate-token-filter/) | N/A | Removes tokens that do not match the specified predicate script. Supports only inline Painless scripts. [`remove_duplicates`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/remove-duplicates/) | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position. -`reverse` | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`. -`shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`]. -`snowball` | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). You can use the `snowball` token filter with the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`. -`stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`. -`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed. -`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream. +[`reverse`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/reverse/) | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`. +[`shingle`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/shingle/) | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but are generated using words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`]. +[`snowball`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/snowball/) | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). The `snowball` token filter supports using the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`. +[`stemmer`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stemmer/) | N/A | Provides algorithmic stemming for the following languages used in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`. +[`stemmer_override`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stemmer-override/) | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed. +[`stop`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stop/) | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream. [`synonym`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym/) | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file. [`synonym_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym-graph/) | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process. -`trim` | [TrimFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing white space from each token in a stream. -`truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit. -`unique` | N/A | Ensures each token is unique by removing duplicate tokens from a stream. -`uppercase` | [UpperCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to uppercase. -`word_delimiter` | [WordDelimiterFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. -`word_delimiter_graph` | [WordDelimiterGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. Assigns multi-position tokens a `positionLength` attribute. +[`trim`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/trim/) | [TrimFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing white space characters from each token in a stream. +[`truncate`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/truncate/) | [TruncateTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens with lengths exceeding the specified character limit. +[`unique`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/unique/) | N/A | Ensures that each token is unique by removing duplicate tokens from a stream. +[`uppercase`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/uppercase/) | [UpperCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to uppercase. +[`word_delimiter`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter/) | [WordDelimiterFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens on non-alphanumeric characters and performs normalization based on the specified rules. +[`word_delimiter_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter-graph/) | [WordDelimiterGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens on non-alphanumeric characters and performs normalization based on the specified rules. Assigns a `positionLength` attribute to multi-position tokens. diff --git a/_analyzers/token-filters/keyword-repeat.md b/_analyzers/token-filters/keyword-repeat.md new file mode 100644 index 0000000000..5ba15a037c --- /dev/null +++ b/_analyzers/token-filters/keyword-repeat.md @@ -0,0 +1,160 @@ +--- +layout: default +title: Keyword repeat +parent: Token filters +nav_order: 210 +--- + +# Keyword repeat token filter + +The `keyword_repeat` token filter emits the keyword version of a token into a token stream. This filter is typically used when you want to retain both the original token and its modified version after further token transformations, such as stemming or synonym expansion. The duplicated tokens allow the original, unchanged version of the token to remain in the final analysis alongside the modified versions. + +The `keyword_repeat` token filter should be placed before stemming filters. Stemming is not applied to every token, thus you may have duplicate tokens in the same position after stemming. To remove duplicate tokens, use the `remove_duplicates` token filter after the stemmer. +{: .note} + + +## Example + +The following example request creates a new index named `my_index` and configures an analyzer with a `keyword_repeat` filter: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "filter": { + "my_kstem": { + "type": "kstem" + }, + "my_lowercase": { + "type": "lowercase" + } + }, + "analyzer": { + "my_custom_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "my_lowercase", + "keyword_repeat", + "my_kstem" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_custom_analyzer", + "text": "Stopped quickly" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "stopped", + "start_offset": 0, + "end_offset": 7, + "type": "", + "position": 0 + }, + { + "token": "stop", + "start_offset": 0, + "end_offset": 7, + "type": "", + "position": 0 + }, + { + "token": "quickly", + "start_offset": 8, + "end_offset": 15, + "type": "", + "position": 1 + }, + { + "token": "quick", + "start_offset": 8, + "end_offset": 15, + "type": "", + "position": 1 + } + ] +} +``` + +You can further examine the impact of the `keyword_repeat` token filter by adding the following parameters to the `_analyze` query: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_custom_analyzer", + "text": "Stopped quickly", + "explain": true, + "attributes": "keyword" +} +``` +{% include copy-curl.html %} + +The response includes detailed information, such as tokenization, filtering, and the application of specific token filters: + +```json +{ + "detail": { + "custom_analyzer": true, + "charfilters": [], + "tokenizer": { + "name": "standard", + "tokens": [ + {"token": "OpenSearch","start_offset": 0,"end_offset": 10,"type": "","position": 0}, + {"token": "helped","start_offset": 11,"end_offset": 17,"type": "","position": 1}, + {"token": "many","start_offset": 18,"end_offset": 22,"type": "","position": 2}, + {"token": "employers","start_offset": 23,"end_offset": 32,"type": "","position": 3} + ] + }, + "tokenfilters": [ + { + "name": "lowercase", + "tokens": [ + {"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "","position": 0}, + {"token": "helped","start_offset": 11,"end_offset": 17,"type": "","position": 1}, + {"token": "many","start_offset": 18,"end_offset": 22,"type": "","position": 2}, + {"token": "employers","start_offset": 23,"end_offset": 32,"type": "","position": 3} + ] + }, + { + "name": "keyword_marker_filter", + "tokens": [ + {"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "","position": 0,"keyword": true}, + {"token": "helped","start_offset": 11,"end_offset": 17,"type": "","position": 1,"keyword": false}, + {"token": "many","start_offset": 18,"end_offset": 22,"type": "","position": 2,"keyword": false}, + {"token": "employers","start_offset": 23,"end_offset": 32,"type": "","position": 3,"keyword": false} + ] + }, + { + "name": "kstem_filter", + "tokens": [ + {"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "","position": 0,"keyword": true}, + {"token": "help","start_offset": 11,"end_offset": 17,"type": "","position": 1,"keyword": false}, + {"token": "many","start_offset": 18,"end_offset": 22,"type": "","position": 2,"keyword": false}, + {"token": "employer","start_offset": 23,"end_offset": 32,"type": "","position": 3,"keyword": false} + ] + } + ] + } +} +``` \ No newline at end of file diff --git a/_analyzers/token-filters/kstem.md b/_analyzers/token-filters/kstem.md new file mode 100644 index 0000000000..d13fd2c675 --- /dev/null +++ b/_analyzers/token-filters/kstem.md @@ -0,0 +1,92 @@ +--- +layout: default +title: KStem +parent: Token filters +nav_order: 220 +--- + +# KStem token filter + +The `kstem` token filter is a stemming filter used to reduce words to their root forms. The filter is a lightweight algorithmic stemmer designed for the English language that performs the following stemming operations: + +- Reduces plurals to their singular form. +- Converts different verb tenses to their base form. +- Removes common derivational endings, such as "-ing" or "-ed". + +The `kstem` token filter is equivalent to the a `stemmer` filter configured with a `light_english` language. It provides a more conservative stemming compared to other stemming filters like `porter_stem`. + +The `kstem` token filter is based on the Lucene KStemFilter. For more information, see the [Lucene documentation](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html). + +## Example + +The following example request creates a new index named `my_kstem_index` and configures an analyzer with a `kstem` filter: + +```json +PUT /my_kstem_index +{ + "settings": { + "analysis": { + "filter": { + "kstem_filter": { + "type": "kstem" + } + }, + "analyzer": { + "my_kstem_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "kstem_filter" + ] + } + } + } + }, + "mappings": { + "properties": { + "content": { + "type": "text", + "analyzer": "my_kstem_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_kstem_index/_analyze +{ + "analyzer": "my_kstem_analyzer", + "text": "stops stopped" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "stop", + "start_offset": 0, + "end_offset": 5, + "type": "", + "position": 0 + }, + { + "token": "stop", + "start_offset": 6, + "end_offset": 13, + "type": "", + "position": 1 + } + ] +} +``` \ No newline at end of file diff --git a/_analyzers/token-filters/kuromoji-completion.md b/_analyzers/token-filters/kuromoji-completion.md new file mode 100644 index 0000000000..24833e92e1 --- /dev/null +++ b/_analyzers/token-filters/kuromoji-completion.md @@ -0,0 +1,127 @@ +--- +layout: default +title: Kuromoji completion +parent: Token filters +nav_order: 230 +--- + +# Kuromoji completion token filter + +The `kuromoji_completion` token filter is used to stem Katakana words in Japanese, which are often used to represent foreign words or loanwords. This filter is especially useful for autocompletion or suggest queries, in which partial matches on Katakana words can be expanded to include their full forms. + +To use this token filter, you must first install the `analysis-kuromoji` plugin on all nodes by running `bin/opensearch-plugin install analysis-kuromoji` and then restart the cluster. For more information about installing additional plugins, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/index/). + +## Example + +The following example request creates a new index named `kuromoji_sample` and configures an analyzer with a `kuromoji_completion` filter: + +```json +PUT kuromoji_sample +{ + "settings": { + "index": { + "analysis": { + "analyzer": { + "my_analyzer": { + "tokenizer": "kuromoji_tokenizer", + "filter": [ + "my_katakana_stemmer" + ] + } + }, + "filter": { + "my_katakana_stemmer": { + "type": "kuromoji_completion" + } + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer with text that translates to "use a computer": + +```json +POST /kuromoji_sample/_analyze +{ + "analyzer": "my_analyzer", + "text": "コンピューターを使う" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "コンピューター", // The original Katakana word "computer". + "start_offset": 0, + "end_offset": 7, + "type": "word", + "position": 0 + }, + { + "token": "konpyuーtaー", // Romanized version (Romaji) of "コンピューター". + "start_offset": 0, + "end_offset": 7, + "type": "word", + "position": 0 + }, + { + "token": "konnpyuーtaー", // Another possible romanized version of "コンピューター" (with a slight variation in the spelling). + "start_offset": 0, + "end_offset": 7, + "type": "word", + "position": 0 + }, + { + "token": "を", // A Japanese particle, "wo" or "o" + "start_offset": 7, + "end_offset": 8, + "type": "word", + "position": 1 + }, + { + "token": "wo", // Romanized form of the particle "を" (often pronounced as "o"). + "start_offset": 7, + "end_offset": 8, + "type": "word", + "position": 1 + }, + { + "token": "o", // Another version of the romanization. + "start_offset": 7, + "end_offset": 8, + "type": "word", + "position": 1 + }, + { + "token": "使う", // The verb "use" in Kanji. + "start_offset": 8, + "end_offset": 10, + "type": "word", + "position": 2 + }, + { + "token": "tukau", // Romanized version of "使う" + "start_offset": 8, + "end_offset": 10, + "type": "word", + "position": 2 + }, + { + "token": "tsukau", // Another romanized version of "使う", where "tsu" is more phonetically correct + "start_offset": 8, + "end_offset": 10, + "type": "word", + "position": 2 + } + ] +} +``` \ No newline at end of file diff --git a/_analyzers/token-filters/length.md b/_analyzers/token-filters/length.md new file mode 100644 index 0000000000..f6c5dcc706 --- /dev/null +++ b/_analyzers/token-filters/length.md @@ -0,0 +1,91 @@ +--- +layout: default +title: Length +parent: Token filters +nav_order: 240 +--- + +# Length token filter + +The `length` token filter is used to remove tokens that don't meet specified length criteria (minimum and maximum values) from the token stream. + +## Parameters + +The `length` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`min` | Optional | Integer | The minimum token length. Default is `0`. +`max` | Optional | Integer | The maximum token length. Default is `Integer.MAX_VALUE` (`2147483647`). + + +## Example + +The following example request creates a new index named `my_index` and configures an analyzer with a `length` filter: + +```json +PUT my_index +{ + "settings": { + "analysis": { + "analyzer": { + "only_keep_4_to_10_characters": { + "tokenizer": "whitespace", + "filter": [ "length_4_to_10" ] + } + }, + "filter": { + "length_4_to_10": { + "type": "length", + "min": 4, + "max": 10 + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my_index/_analyze +{ + "analyzer": "only_keep_4_to_10_characters", + "text": "OpenSearch is a great tool!" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "OpenSearch", + "start_offset": 0, + "end_offset": 10, + "type": "word", + "position": 0 + }, + { + "token": "great", + "start_offset": 16, + "end_offset": 21, + "type": "word", + "position": 3 + }, + { + "token": "tool!", + "start_offset": 22, + "end_offset": 27, + "type": "word", + "position": 4 + } + ] +} +``` diff --git a/_analyzers/token-filters/limit.md b/_analyzers/token-filters/limit.md new file mode 100644 index 0000000000..a849f5f06b --- /dev/null +++ b/_analyzers/token-filters/limit.md @@ -0,0 +1,89 @@ +--- +layout: default +title: Limit +parent: Token filters +nav_order: 250 +--- + +# Limit token filter + +The `limit` token filter is used to limit the number of tokens passed through the analysis chain. + +## Parameters + +The `limit` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`max_token_count` | Optional | Integer | The maximum number of tokens to be generated. Default is `1`. +`consume_all_tokens` | Optional | Boolean | (Expert-level setting) Uses all tokens from the tokenizer, even if the result exceeds `max_token_count`. When this parameter is set, the output still only contains the number of tokens specified by `max_token_count`. However, all tokens generated by the tokenizer are processed. Default is `false`. + +## Example + +The following example request creates a new index named `my_index` and configures an analyzer with a `limit` filter: + +```json +PUT my_index +{ + "settings": { + "analysis": { + "analyzer": { + "three_token_limit": { + "tokenizer": "standard", + "filter": [ "custom_token_limit" ] + } + }, + "filter": { + "custom_token_limit": { + "type": "limit", + "max_token_count": 3 + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my_index/_analyze +{ + "analyzer": "three_token_limit", + "text": "OpenSearch is a powerful and flexible search engine." +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "OpenSearch", + "start_offset": 0, + "end_offset": 10, + "type": "", + "position": 0 + }, + { + "token": "is", + "start_offset": 11, + "end_offset": 13, + "type": "", + "position": 1 + }, + { + "token": "a", + "start_offset": 14, + "end_offset": 15, + "type": "", + "position": 2 + } + ] +} +``` diff --git a/_analyzers/token-filters/lowercase.md b/_analyzers/token-filters/lowercase.md new file mode 100644 index 0000000000..89f0f219fa --- /dev/null +++ b/_analyzers/token-filters/lowercase.md @@ -0,0 +1,82 @@ +--- +layout: default +title: Lowercase +parent: Token filters +nav_order: 260 +--- + +# Lowercase token filter + +The `lowercase` token filter is used to convert all characters in the token stream to lowercase, making searches case insensitive. + +## Parameters + +The `lowercase` token filter can be configured with the following parameter. + +Parameter | Required/Optional | Description +:--- | :--- | :--- + `language` | Optional | Specifies a language-specific token filter. Valid values are:
- [`greek`](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)
- [`irish`](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)
- [`turkish`](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html).
Default is the [Lucene LowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html). + +## Example + +The following example request creates a new index named `custom_lowercase_example`. It configures an analyzer with a `lowercase` filter and specifies `greek` as the `language`: + +```json +PUT /custom_lowercase_example +{ + "settings": { + "analysis": { + "analyzer": { + "greek_lowercase_example": { + "type": "custom", + "tokenizer": "standard", + "filter": ["greek_lowercase"] + } + }, + "filter": { + "greek_lowercase": { + "type": "lowercase", + "language": "greek" + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /custom_lowercase_example/_analyze +{ + "analyzer": "greek_lowercase_example", + "text": "Αθήνα ΕΛΛΑΔΑ" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "αθηνα", + "start_offset": 0, + "end_offset": 5, + "type": "", + "position": 0 + }, + { + "token": "ελλαδα", + "start_offset": 6, + "end_offset": 12, + "type": "", + "position": 1 + } + ] +} +``` diff --git a/_analyzers/token-filters/reverse.md b/_analyzers/token-filters/reverse.md new file mode 100644 index 0000000000..dc48f07e77 --- /dev/null +++ b/_analyzers/token-filters/reverse.md @@ -0,0 +1,86 @@ +--- +layout: default +title: Reverse +parent: Token filters +nav_order: 360 +--- + +# Reverse token filter + +The `reverse` token filter reverses the order of the characters in each token, making suffix information accessible at the beginning of the reversed tokens during analysis. + +This is useful for suffix-based searches: + +The `reverse` token filter is useful when you need to perform suffix-based searches, such as in the following scenarios: + +- **Suffix matching**: Searching for words based on their suffixes, such as identifying words with a specific ending (for example, `-tion` or `-ing`). +- **File extension searches**: Searching for files by their extensions, such as `.txt` or `.jpg`. +- **Custom sorting or ranking**: By reversing tokens, you can implement unique sorting or ranking logic based on suffixes. +- **Autocomplete for suffixes**: Implementing autocomplete suggestions that use suffixes rather than prefixes. + + +## Example + +The following example request creates a new index named `my-reverse-index` and configures an analyzer with a `reverse` filter: + +```json +PUT /my-reverse-index +{ + "settings": { + "analysis": { + "filter": { + "reverse_filter": { + "type": "reverse" + } + }, + "analyzer": { + "my_reverse_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "reverse_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my-reverse-index/_analyze +{ + "analyzer": "my_reverse_analyzer", + "text": "hello world" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "olleh", + "start_offset": 0, + "end_offset": 5, + "type": "", + "position": 0 + }, + { + "token": "dlrow", + "start_offset": 6, + "end_offset": 11, + "type": "", + "position": 1 + } + ] +} +``` \ No newline at end of file diff --git a/_analyzers/token-filters/shingle.md b/_analyzers/token-filters/shingle.md new file mode 100644 index 0000000000..ea961bf3e0 --- /dev/null +++ b/_analyzers/token-filters/shingle.md @@ -0,0 +1,120 @@ +--- +layout: default +title: Shingle +parent: Token filters +nav_order: 370 +--- + +# Shingle token filter + +The `shingle` token filter is used to generate word n-grams, or _shingles_, from input text. For example, for the string `slow green turtle`, the `shingle` filter creates the following one- and two-word shingles: `slow`, `slow green`, `green`, `green turtle`, and `turtle`. + +This token filter is often used in conjunction with other filters to enhance search accuracy by indexing phrases rather than individual tokens. For more information, see [Phrase suggester]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/did-you-mean/#phrase-suggester). + +## Parameters + +The `shingle` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`min_shingle_size` | Optional | Integer | The minimum number of tokens to concatenate. Default is `2`. +`max_shingle_size` | Optional | Integer | The maximum number of tokens to concatenate. Default is `2`. +`output_unigrams` | Optional | Boolean | Whether to include unigrams (individual tokens) as output. Default is `true`. +`output_unigrams_if_no_shingles` | Optional | Boolean | Whether to output unigrams if no shingles are generated. Default is `false`. +`token_separator` | Optional | String | A separator used to concatenate tokens into a shingle. Default is a space (`" "`). +`filler_token` | Optional | String | A token inserted into empty positions or gaps between tokens. Default is an underscore (`_`). + +If `output_unigrams` and `output_unigrams_if_no_shingles` are both set to `true`, `output_unigrams_if_no_shingles` is ignored. +{: .note} + +## Example + +The following example request creates a new index named `my-shingle-index` and configures an analyzer with a `shingle` filter: + +```json +PUT /my-shingle-index +{ + "settings": { + "analysis": { + "filter": { + "my_shingle_filter": { + "type": "shingle", + "min_shingle_size": 2, + "max_shingle_size": 2, + "output_unigrams": true + } + }, + "analyzer": { + "my_shingle_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_shingle_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my-shingle-index/_analyze +{ + "analyzer": "my_shingle_analyzer", + "text": "slow green turtle" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "slow", + "start_offset": 0, + "end_offset": 4, + "type": "", + "position": 0 + }, + { + "token": "slow green", + "start_offset": 0, + "end_offset": 10, + "type": "shingle", + "position": 0, + "positionLength": 2 + }, + { + "token": "green", + "start_offset": 5, + "end_offset": 10, + "type": "", + "position": 1 + }, + { + "token": "green turtle", + "start_offset": 5, + "end_offset": 17, + "type": "shingle", + "position": 1, + "positionLength": 2 + }, + { + "token": "turtle", + "start_offset": 11, + "end_offset": 17, + "type": "", + "position": 2 + } + ] +} +``` \ No newline at end of file diff --git a/_analyzers/token-filters/snowball.md b/_analyzers/token-filters/snowball.md new file mode 100644 index 0000000000..149486e727 --- /dev/null +++ b/_analyzers/token-filters/snowball.md @@ -0,0 +1,108 @@ +--- +layout: default +title: Snowball +parent: Token filters +nav_order: 380 +--- + +# Snowball token filter + +The `snowball` token filter is a stemming filter based on the [Snowball](https://snowballstem.org/) algorithm. It supports many languages and is more efficient and accurate than the Porter stemming algorithm. + +## Parameters + +The `snowball` token filter can be configured with a `language` parameter that accepts the following values: + +- `Arabic` +- `Armenian` +- `Basque` +- `Catalan` +- `Danish` +- `Dutch` +- `English` (default) +- `Estonian` +- `Finnish` +- `French` +- `German` +- `German2` +- `Hungarian` +- `Italian` +- `Irish` +- `Kp` +- `Lithuanian` +- `Lovins` +- `Norwegian` +- `Porter` +- `Portuguese` +- `Romanian` +- `Russian` +- `Spanish` +- `Swedish` +- `Turkish` + +## Example + +The following example request creates a new index named `my-snowball-index` and configures an analyzer with a `snowball` filter: + +```json +PUT /my-snowball-index +{ + "settings": { + "analysis": { + "filter": { + "my_snowball_filter": { + "type": "snowball", + "language": "English" + } + }, + "analyzer": { + "my_snowball_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_snowball_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my-snowball-index/_analyze +{ + "analyzer": "my_snowball_analyzer", + "text": "running runners" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "run", + "start_offset": 0, + "end_offset": 7, + "type": "", + "position": 0 + }, + { + "token": "runner", + "start_offset": 8, + "end_offset": 15, + "type": "", + "position": 1 + } + ] +} +``` \ No newline at end of file diff --git a/_analyzers/token-filters/stemmer-override.md b/_analyzers/token-filters/stemmer-override.md new file mode 100644 index 0000000000..c06f673714 --- /dev/null +++ b/_analyzers/token-filters/stemmer-override.md @@ -0,0 +1,139 @@ +--- +layout: default +title: Stemmer override +parent: Token filters +nav_order: 400 +--- + +# Stemmer override token filter + +The `stemmer_override` token filter allows you to define custom stemming rules that override the behavior of default stemmers like Porter or Snowball. This can be useful when you want to apply specific stemming behavior to certain words that might not be modified correctly by the standard stemming algorithms. + +## Parameters + +The `stemmer_override` token filter must be configured with exactly one of the following parameters. + +Parameter | Data type | Description +:--- | :--- | :--- +`rules` | String | Defines the override rules directly in the settings. +`rules_path` | String | Specifies the path to the file containing custom rules (mappings). The path can be either an absolute path or a path relative to the config directory. + +## Example + +The following example request creates a new index named `my-index` and configures an analyzer with a `stemmer_override` filter: + +```json +PUT /my-index +{ + "settings": { + "analysis": { + "filter": { + "my_stemmer_override_filter": { + "type": "stemmer_override", + "rules": [ + "running, runner => run", + "bought => buy", + "best => good" + ] + } + }, + "analyzer": { + "my_custom_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_stemmer_override_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my-index/_analyze +{ + "analyzer": "my_custom_analyzer", + "text": "I am a runner and bought the best shoes" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "i", + "start_offset": 0, + "end_offset": 1, + "type": "", + "position": 0 + }, + { + "token": "am", + "start_offset": 2, + "end_offset": 4, + "type": "", + "position": 1 + }, + { + "token": "a", + "start_offset": 5, + "end_offset": 6, + "type": "", + "position": 2 + }, + { + "token": "run", + "start_offset": 7, + "end_offset": 13, + "type": "", + "position": 3 + }, + { + "token": "and", + "start_offset": 14, + "end_offset": 17, + "type": "", + "position": 4 + }, + { + "token": "buy", + "start_offset": 18, + "end_offset": 24, + "type": "", + "position": 5 + }, + { + "token": "the", + "start_offset": 25, + "end_offset": 28, + "type": "", + "position": 6 + }, + { + "token": "good", + "start_offset": 29, + "end_offset": 33, + "type": "", + "position": 7 + }, + { + "token": "shoes", + "start_offset": 34, + "end_offset": 39, + "type": "", + "position": 8 + } + ] +} +``` \ No newline at end of file diff --git a/_analyzers/token-filters/stemmer.md b/_analyzers/token-filters/stemmer.md new file mode 100644 index 0000000000..dd1344fcbc --- /dev/null +++ b/_analyzers/token-filters/stemmer.md @@ -0,0 +1,118 @@ +--- +layout: default +title: Stemmer +parent: Token filters +nav_order: 390 +--- + +# Stemmer token filter + +The `stemmer` token filter reduces words to their root or base form (also known as their _stem_). + +## Parameters + +The `stemmer` token filter can be configured with a `language` parameter that accepts the following values: + +- Arabic: `arabic` +- Armenian: `armenian` +- Basque: `basque` +- Bengali: `bengali` +- Brazilian Portuguese: `brazilian` +- Bulgarian: `bulgarian` +- Catalan: `catalan` +- Czech: `czech` +- Danish: `danish` +- Dutch: `dutch, dutch_kp` +- English: `english` (default), `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english` +- Estonian: `estonian` +- Finnish: `finnish`, `light_finnish` +- French: `light_french`, `french`, `minimal_french` +- Galician: `galician`, `minimal_galician` (plural step only) +- German: `light_german`, `german`, `german2`, `minimal_german` +- Greek: `greek` +- Hindi: `hindi` +- Hungarian: `hungarian, light_hungarian` +- Indonesian: `indonesian` +- Irish: `irish` +- Italian: `light_italian, italian` +- Kurdish (Sorani): `sorani` +- Latvian: `latvian` +- Lithuanian: `lithuanian` +- Norwegian (Bokmål): `norwegian`, `light_norwegian`, `minimal_norwegian` +- Norwegian (Nynorsk): `light_nynorsk`, `minimal_nynorsk` +- Portuguese: `light_portuguese`, `minimal_portuguese`, `portuguese`, `portuguese_rslp` +- Romanian: `romanian` +- Russian: `russian`, `light_russian` +- Spanish: `light_spanish`, `spanish` +- Swedish: `swedish`, `light_swedish` +- Turkish: `turkish` + +You can also use the `name` parameter as an alias for the `language` parameter. If both are set, the `name` parameter is ignored. +{: .note} + +## Example + +The following example request creates a new index named `my-stemmer-index` and configures an analyzer with a `stemmer` filter: + +```json +PUT /my-stemmer-index +{ + "settings": { + "analysis": { + "filter": { + "my_english_stemmer": { + "type": "stemmer", + "language": "english" + } + }, + "analyzer": { + "my_stemmer_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_english_stemmer" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my-stemmer-index/_analyze +{ + "analyzer": "my_stemmer_analyzer", + "text": "running runs" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "run", + "start_offset": 0, + "end_offset": 7, + "type": "", + "position": 0 + }, + { + "token": "run", + "start_offset": 8, + "end_offset": 12, + "type": "", + "position": 1 + } + ] +} +``` \ No newline at end of file diff --git a/_analyzers/token-filters/stop.md b/_analyzers/token-filters/stop.md new file mode 100644 index 0000000000..8f3e01b72d --- /dev/null +++ b/_analyzers/token-filters/stop.md @@ -0,0 +1,111 @@ +--- +layout: default +title: Stop +parent: Token filters +nav_order: 410 +--- + +# Stop token filter + +The `stop` token filter is used to remove common words (also known as _stopwords_) from a token stream during analysis. Stopwords are typically articles and prepositions, such as `a` or `for`. These words are not significantly meaningful in search queries and are often excluded to improve search efficiency and relevance. + +The default list of English stopwords includes the following words: `a`, `an`, `and`, `are`, `as`, `at`, `be`, `but`, `by`, `for`, `if`, `in`, `into`, `is`, `it`, `no`, `not`, `of`, `on`, `or`, `such`, `that`, `the`, `their`, `then`, `there`, `these`, `they`, `this`, `to`, `was`, `will`, and `with`. + +## Parameters + +The `stop` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`stopwords` | Optional | String | Specifies either a custom array of stopwords or a language for which to fetch the predefined Lucene stopword list:

- [`_arabic_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt)
- [`_armenian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt)
- [`_basque_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt)
- [`_bengali_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/bn/stopwords.txt)
- [`_brazilian_` (Brazilian Portuguese)](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/br/stopwords.txt)
- [`_bulgarian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt)
- [`_catalan_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt)
- [`_cjk_` (Chinese, Japanese, and Korean)](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/cjk/stopwords.txt)
- [`_czech_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/cz/stopwords.txt)
- [`_danish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/danish_stop.txt)
- [`_dutch_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/dutch_stop.txt)
- [`_english_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L48) (Default)
- [`_estonian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/et/stopwords.txt)
- [`_finnish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt)
- [`_french_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt)
- [`_galician_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/gl/stopwords.txt)
- [`_german_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/german_stop.txt)
- [`_greek_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/el/stopwords.txt)
- [`_hindi_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/hi/stopwords.txt)
- [`_hungarian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/hungarian_stop.txt)
- [`_indonesian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt)
- [`_irish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ga/stopwords.txt)
- [`_italian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt)
- [`_latvian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/lv/stopwords.txt)
- [`_lithuanian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/lt/stopwords.txt)
- [`_norwegian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/norwegian_stop.txt)
- [`_persian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt)
- [`_portuguese_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/portuguese_stop.txt)
- [`_romanian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt)
- [`_russian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt)
- [`_sorani_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/sr/stopwords.txt)
- [`_spanish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ckb/stopwords.txt)
- [`_swedish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/swedish_stop.txt)
- [`_thai_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/th/stopwords.txt)
- [`_turkish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt) +`stopwords_path` | Optional | String | Specifies the file path (absolute or relative to the config directory) of the file containing custom stopwords. +`ignore_case` | Optional | Boolean | If `true`, stopwords will be matched regardless of their case. Default is `false`. +`remove_trailing` | Optional | Boolean | If `true`, trailing stopwords will be removed during analysis. Default is `true`. + +## Example + +The following example request creates a new index named `my-stopword-index` and configures an analyzer with a `stop` filter that uses the predefined stopword list for the English language: + +```json +PUT /my-stopword-index +{ + "settings": { + "analysis": { + "filter": { + "my_stop_filter": { + "type": "stop", + "stopwords": "_english_" + } + }, + "analyzer": { + "my_stop_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_stop_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my-stopword-index/_analyze +{ + "analyzer": "my_stop_analyzer", + "text": "A quick dog jumps over the turtle" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "quick", + "start_offset": 2, + "end_offset": 7, + "type": "", + "position": 1 + }, + { + "token": "dog", + "start_offset": 8, + "end_offset": 11, + "type": "", + "position": 2 + }, + { + "token": "jumps", + "start_offset": 12, + "end_offset": 17, + "type": "", + "position": 3 + }, + { + "token": "over", + "start_offset": 18, + "end_offset": 22, + "type": "", + "position": 4 + }, + { + "token": "turtle", + "start_offset": 27, + "end_offset": 33, + "type": "", + "position": 6 + } + ] +} +``` \ No newline at end of file diff --git a/_analyzers/token-filters/synonym-graph.md b/_analyzers/token-filters/synonym-graph.md index 75c7c79151..d8e763d1fc 100644 --- a/_analyzers/token-filters/synonym-graph.md +++ b/_analyzers/token-filters/synonym-graph.md @@ -19,7 +19,7 @@ Parameter | Required/Optional | Data type | Description `synonyms_path` | Either `synonyms` or `synonyms_path` must be specified | String | The file path to a file containing synonym rules (either an absolute path or a path relative to the config directory). `lenient` | Optional | Boolean | Whether to ignore exceptions when loading the rule configurations. Default is `false`. `format` | Optional | String | Specifies the format used to determine how OpenSearch defines and interprets synonyms. Valid values are:
- `solr`
- [`wordnet`](https://wordnet.princeton.edu/).
Default is `solr`. -`expand` | Optional | Boolean | Whether to expand equivalent synonym rules. Default is `false`.

For example:
If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows:
- `quick => quick`
- `quick => fast`
- `fast => quick`
- `fast => fast`

If `expand` is set to `false`, the synonym rules are configured as follows:
- `quick => quick`
- `fast => quick` +`expand` | Optional | Boolean | Whether to expand equivalent synonym rules. Default is `true`.

For example:
If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows:
- `quick => quick`
- `quick => fast`
- `fast => quick`
- `fast => fast`

If `expand` is set to `false`, the synonym rules are configured as follows:
- `quick => quick`
- `fast => quick` ## Example: Solr format diff --git a/_analyzers/token-filters/synonym.md b/_analyzers/token-filters/synonym.md index a6865b14d7..a1dfff845d 100644 --- a/_analyzers/token-filters/synonym.md +++ b/_analyzers/token-filters/synonym.md @@ -2,7 +2,7 @@ layout: default title: Synonym parent: Token filters -nav_order: 420 +nav_order: 415 --- # Synonym token filter @@ -19,7 +19,7 @@ Parameter | Required/Optional | Data type | Description `synonyms_path` | Either `synonyms` or `synonyms_path` must be specified | String | The file path to a file containing synonym rules (either an absolute path or a path relative to the config directory). `lenient` | Optional | Boolean | Whether to ignore exceptions when loading the rule configurations. Default is `false`. `format` | Optional | String | Specifies the format used to determine how OpenSearch defines and interprets synonyms. Valid values are:
- `solr`
- [`wordnet`](https://wordnet.princeton.edu/).
Default is `solr`. -`expand` | Optional | Boolean | Whether to expand equivalent synonym rules. Default is `false`.

For example:
If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows:
- `quick => quick`
- `quick => fast`
- `fast => quick`
- `fast => fast`

If `expand` is set to `false`, the synonym rules are configured as follows:
- `quick => quick`
- `fast => quick` +`expand` | Optional | Boolean | Whether to expand equivalent synonym rules. Default is `true`.

For example:
If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows:
- `quick => quick`
- `quick => fast`
- `fast => quick`
- `fast => fast`

If `expand` is set to `false`, the synonym rules are configured as follows:
- `quick => quick`
- `fast => quick` ## Example: Solr format diff --git a/_analyzers/token-filters/trim.md b/_analyzers/token-filters/trim.md new file mode 100644 index 0000000000..cdfebed52f --- /dev/null +++ b/_analyzers/token-filters/trim.md @@ -0,0 +1,93 @@ +--- +layout: default +title: Trim +parent: Token filters +nav_order: 430 +--- + +# Trim token filter + +The `trim` token filter removes leading and trailing white space characters from tokens. + +Many popular tokenizers, such as `standard`, `keyword`, and `whitespace` tokenizers, automatically strip leading and trailing white space characters during tokenization. When using these tokenizers, there is no need to configure an additional `trim` token filter. +{: .note} + + +## Example + +The following example request creates a new index named `my_pattern_trim_index` and configures an analyzer with a `trim` filter and a `pattern` tokenizer, which does not remove leading and trailing white space characters: + +```json +PUT /my_pattern_trim_index +{ + "settings": { + "analysis": { + "filter": { + "my_trim_filter": { + "type": "trim" + } + }, + "tokenizer": { + "my_pattern_tokenizer": { + "type": "pattern", + "pattern": "," + } + }, + "analyzer": { + "my_pattern_trim_analyzer": { + "type": "custom", + "tokenizer": "my_pattern_tokenizer", + "filter": [ + "lowercase", + "my_trim_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my_pattern_trim_index/_analyze +{ + "analyzer": "my_pattern_trim_analyzer", + "text": " OpenSearch , is , powerful " +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "opensearch", + "start_offset": 0, + "end_offset": 12, + "type": "word", + "position": 0 + }, + { + "token": "is", + "start_offset": 13, + "end_offset": 18, + "type": "word", + "position": 1 + }, + { + "token": "powerful", + "start_offset": 19, + "end_offset": 32, + "type": "word", + "position": 2 + } + ] +} +``` diff --git a/_analyzers/token-filters/truncate.md b/_analyzers/token-filters/truncate.md new file mode 100644 index 0000000000..16d1452901 --- /dev/null +++ b/_analyzers/token-filters/truncate.md @@ -0,0 +1,107 @@ +--- +layout: default +title: Truncate +parent: Token filters +nav_order: 440 +--- + +# Truncate token filter + +The `truncate` token filter is used to shorten tokens exceeding a specified length. It trims tokens to a maximum number of characters, ensuring that tokens exceeding this limit are truncated. + +## Parameters + +The `truncate` token filter can be configured with the following parameter. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`length` | Optional | Integer | Specifies the maximum length of the generated token. Default is `10`. + +## Example + +The following example request creates a new index named `truncate_example` and configures an analyzer with a `truncate` filter: + +```json +PUT /truncate_example +{ + "settings": { + "analysis": { + "filter": { + "truncate_filter": { + "type": "truncate", + "length": 5 + } + }, + "analyzer": { + "truncate_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "truncate_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /truncate_example/_analyze +{ + "analyzer": "truncate_analyzer", + "text": "OpenSearch is powerful and scalable" +} + +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "opens", + "start_offset": 0, + "end_offset": 10, + "type": "", + "position": 0 + }, + { + "token": "is", + "start_offset": 11, + "end_offset": 13, + "type": "", + "position": 1 + }, + { + "token": "power", + "start_offset": 14, + "end_offset": 22, + "type": "", + "position": 2 + }, + { + "token": "and", + "start_offset": 23, + "end_offset": 26, + "type": "", + "position": 3 + }, + { + "token": "scala", + "start_offset": 27, + "end_offset": 35, + "type": "", + "position": 4 + } + ] +} +``` diff --git a/_analyzers/token-filters/unique.md b/_analyzers/token-filters/unique.md new file mode 100644 index 0000000000..c4dfcbab16 --- /dev/null +++ b/_analyzers/token-filters/unique.md @@ -0,0 +1,106 @@ +--- +layout: default +title: Unique +parent: Token filters +nav_order: 450 +--- + +# Unique token filter + +The `unique` token filter ensures that only unique tokens are kept during the analysis process, removing duplicate tokens that appear within a single field or text block. + +## Parameters + +The `unique` token filter can be configured with the following parameter. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`only_on_same_position` | Optional | Boolean | If `true`, the token filter acts as a `remove_duplicates` token filter and only removes tokens that are in the same position. Default is `false`. + +## Example + +The following example request creates a new index named `unique_example` and configures an analyzer with a `unique` filter: + +```json +PUT /unique_example +{ + "settings": { + "analysis": { + "filter": { + "unique_filter": { + "type": "unique", + "only_on_same_position": false + } + }, + "analyzer": { + "unique_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "unique_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /unique_example/_analyze +{ + "analyzer": "unique_analyzer", + "text": "OpenSearch OpenSearch is powerful powerful and scalable" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "opensearch", + "start_offset": 0, + "end_offset": 10, + "type": "", + "position": 0 + }, + { + "token": "is", + "start_offset": 22, + "end_offset": 24, + "type": "", + "position": 1 + }, + { + "token": "powerful", + "start_offset": 25, + "end_offset": 33, + "type": "", + "position": 2 + }, + { + "token": "and", + "start_offset": 43, + "end_offset": 46, + "type": "", + "position": 3 + }, + { + "token": "scalable", + "start_offset": 47, + "end_offset": 55, + "type": "", + "position": 4 + } + ] +} +``` diff --git a/_analyzers/token-filters/uppercase.md b/_analyzers/token-filters/uppercase.md new file mode 100644 index 0000000000..5026892400 --- /dev/null +++ b/_analyzers/token-filters/uppercase.md @@ -0,0 +1,83 @@ +--- +layout: default +title: Uppercase +parent: Token filters +nav_order: 460 +--- + +# Uppercase token filter + +The `uppercase` token filter is used to convert all tokens (words) to uppercase during analysis. + +## Example + +The following example request creates a new index named `uppercase_example` and configures an analyzer with an `uppercase` filter: + +```json +PUT /uppercase_example +{ + "settings": { + "analysis": { + "filter": { + "uppercase_filter": { + "type": "uppercase" + } + }, + "analyzer": { + "uppercase_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "uppercase_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /uppercase_example/_analyze +{ + "analyzer": "uppercase_analyzer", + "text": "OpenSearch is powerful" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "OPENSEARCH", + "start_offset": 0, + "end_offset": 10, + "type": "", + "position": 0 + }, + { + "token": "IS", + "start_offset": 11, + "end_offset": 13, + "type": "", + "position": 1 + }, + { + "token": "POWERFUL", + "start_offset": 14, + "end_offset": 22, + "type": "", + "position": 2 + } + ] +} +``` diff --git a/_analyzers/token-filters/word-delimiter-graph.md b/_analyzers/token-filters/word-delimiter-graph.md new file mode 100644 index 0000000000..b901f5a0e5 --- /dev/null +++ b/_analyzers/token-filters/word-delimiter-graph.md @@ -0,0 +1,164 @@ +--- +layout: default +title: Word delimiter graph +parent: Token filters +nav_order: 480 +--- + +# Word delimiter graph token filter + +The `word_delimiter_graph` token filter is used to splits token on predefined characters and also offers optional token normalization based on customizable rules. + +The `word_delimiter_graph` filter is used to remove punctuation from complex identifiers like part numbers or product IDs. In such cases, it is best used with the `keyword` tokenizer. For hyphenated words, use the `synonym_graph` token filter instead of the `word_delimiter_graph` filter because users frequently search for these terms both with and without hyphens. +{: .note} + +By default, the filter applies the following rules. + +| Description | Input | Output | +|:---|:---|:---| +| Treats non-alphanumeric characters as delimiters. | `ultra-fast` | `ultra`, `fast` | +| Removes delimiters at the beginning or end of tokens. | `Z99++'Decoder'`| `Z99`, `Decoder` | +| Splits tokens when there is a transition between uppercase and lowercase letters. | `OpenSearch` | `Open`, `Search` | +| Splits tokens when there is a transition between letters and numbers. | `T1000` | `T`, `1000` | +| Removes the possessive ('s) from the end of tokens. | `John's` | `John` | + +It's important **not** to use tokenizers that strip punctuation, like the `standard` tokenizer, with this filter. Doing so may prevent proper token splitting and interfere with options like `catenate_all` or `preserve_original`. We recommend using this filter with a `keyword` or `whitespace` tokenizer. +{: .important} + +## Parameters + +You can configure the `word_delimiter_graph` token filter using the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`adjust_offsets` | Optional | Boolean | Determines whether the token offsets should be recalculated for split or concatenated tokens. When `true`, the filter adjusts the token offsets to accurately represent the token's position within the token stream. This adjustment ensures that the token's location in the text aligns with its modified form after processing, which is particularly useful for applications like highlighting or phrase queries. When `false`, the offsets remain unchanged, which may result in misalignment when the processed tokens are mapped back to their positions in the original text. If your analyzer uses filters like `trim` that change the token lengths without changing their offsets, we recommend setting this parameter to `false`. Default is `true`. +`catenate_all` | Optional | Boolean | Produces concatenated tokens from a sequence of alphanumeric parts. For example, `"quick-fast-200"` becomes `[ quickfast200, quick, fast, 200 ]`. Default is `false`. +`catenate_numbers` | Optional | Boolean | Concatenates numerical sequences. For example, `"10-20-30"` becomes `[ 102030, 10, 20, 30 ]`. Default is `false`. +`catenate_words` | Optional | Boolean | Concatenates alphabetic words. For example, `"high-speed-level"` becomes `[ highspeedlevel, high, speed, level ]`. Default is `false`. +`generate_number_parts` | Optional | Boolean | If `true`, numeric tokens (tokens consisting of numbers only) are included in the output. Default is `true`. +`generate_word_parts` | Optional | Boolean | If `true`, alphabetical tokens (tokens consisting of alphabetic characters only) are included in the output. Default is `true`. +`ignore_keywords` | Optional | Boolean | Whether to process tokens marked as keywords. Default is `false`. +`preserve_original` | Optional | Boolean | Keeps the original token (which may include non-alphanumeric delimiters) alongside the generated tokens in the output. For example, `"auto-drive-300"` becomes `[ auto-drive-300, auto, drive, 300 ]`. If `true`, the filter generates multi-position tokens not supported by indexing, so do not use this filter in an index analyzer or use the `flatten_graph` filter after this filter. Default is `false`. +`protected_words` | Optional | Array of strings | Specifies tokens that should not be split. +`protected_words_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing tokens that should not be separated by new lines. +`split_on_case_change` | Optional | Boolean | Splits tokens where consecutive letters have different cases (one is lowercase and the other is uppercase). For example, `"OpenSearch"` becomes `[ Open, Search ]`. Default is `true`. +`split_on_numerics` | Optional | Boolean | Splits tokens where there are consecutive letters and numbers. For example `"v8engine"` will become `[ v, 8, engine ]`. Default is `true`. +`stem_english_possessive` | Optional | Boolean | Removes English possessive endings, such as `'s`. Default is `true`. +`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split on hyphens. Valid types are:
- `ALPHA`: alphabetical
- `ALPHANUM`: alphanumeric
- `DIGIT`: numeric
- `LOWER`: lowercase alphabetical
- `SUBWORD_DELIM`: non-alphanumeric delimiter
- `UPPER`: uppercase alphabetical +`type_table_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing a custom character map. The map specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For valid types, see `type_table`. + +## Example + +The following example request creates a new index named `my-custom-index` and configures an analyzer with a `word_delimiter_graph` filter: + +```json +PUT /my-custom-index +{ + "settings": { + "analysis": { + "analyzer": { + "custom_analyzer": { + "tokenizer": "keyword", + "filter": [ "custom_word_delimiter_filter" ] + } + }, + "filter": { + "custom_word_delimiter_filter": { + "type": "word_delimiter_graph", + "split_on_case_change": true, + "split_on_numerics": true, + "stem_english_possessive": true + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my-custom-index/_analyze +{ + "analyzer": "custom_analyzer", + "text": "FastCar's Model2023" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "Fast", + "start_offset": 0, + "end_offset": 4, + "type": "word", + "position": 0 + }, + { + "token": "Car", + "start_offset": 4, + "end_offset": 7, + "type": "word", + "position": 1 + }, + { + "token": "Model", + "start_offset": 10, + "end_offset": 15, + "type": "word", + "position": 2 + }, + { + "token": "2023", + "start_offset": 15, + "end_offset": 19, + "type": "word", + "position": 3 + } + ] +} +``` + + +## Differences between the word_delimiter_graph and word_delimiter filters + + +Both the `word_delimiter_graph` and `word_delimiter` token filters generate tokens spanning multiple positions when any of the following parameters are set to `true`: + +- `catenate_all` +- `catenate_numbers` +- `catenate_words` +- `preserve_original` + +To illustrate the differences between these filters, consider the input text `Pro-XT500`. + + +### word_delimiter_graph + + +The `word_delimiter_graph` filter assigns a `positionLength` attribute to multi-position tokens, indicating how many positions a token spans. This ensures that the filter always generates valid token graphs, making it suitable for use in advanced token graph scenarios. Although token graphs with multi-position tokens are not supported for indexing, they can still be useful in search scenarios. For example, queries like `match_phrase` can use these graphs to generate multiple subqueries from a single input string. For the example input text, the `word_delimiter_graph` filter generates the following tokens: + +- `Pro` (position 1) +- `XT500` (position 2) +- `ProXT500` (position 1, `positionLength`: 2) + +The `positionLength` attribute the production of a valid graph to be used in advanced queries. + + +### word_delimiter + + +In contrast, the `word_delimiter` filter does not assign a `positionLength` attribute to multi-position tokens, leading to invalid graphs when these tokens are present. For the example input text, the `word_delimiter` filter generates the following tokens: + +- `Pro` (position 1) +- `XT500` (position 2) +- `ProXT500` (position 1, no `positionLength`) + +The lack of a `positionLength` attribute results in a token graph that is invalid for token streams containing multi-position tokens. \ No newline at end of file diff --git a/_analyzers/token-filters/word-delimiter.md b/_analyzers/token-filters/word-delimiter.md new file mode 100644 index 0000000000..77a71f28fb --- /dev/null +++ b/_analyzers/token-filters/word-delimiter.md @@ -0,0 +1,128 @@ +--- +layout: default +title: Word delimiter +parent: Token filters +nav_order: 470 +--- + +# Word delimiter token filter + +The `word_delimiter` token filter is used to splits token on predefined characters and also offers optional token normalization based on customizable rules. + +We recommend using the `word_delimiter_graph` filter instead of the `word_delimiter` filter whenever possible because the `word_delimiter` filter sometimes produces invalid token graphs. For more information about the differences between the two filters, see [Differences between the `word_delimiter_graph` and `word_delimiter` filters]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter-graph/#differences-between-the-word_delimiter_graph-and-word_delimiter-filters). +{: .important} + +The `word_delimiter` filter is used to remove punctuation from complex identifiers like part numbers or product IDs. In such cases, it is best used with the `keyword` tokenizer. For hyphenated words, use the `synonym_graph` token filter instead of the `word_delimiter` filter because users frequently search for these terms both with and without hyphens. +{: .note} + +By default, the filter applies the following rules. + +| Description | Input | Output | +|:---|:---|:---| +| Treats non-alphanumeric characters as delimiters. | `ultra-fast` | `ultra`, `fast` | +| Removes delimiters at the beginning or end of tokens. | `Z99++'Decoder'`| `Z99`, `Decoder` | +| Splits tokens when there is a transition between uppercase and lowercase letters. | `OpenSearch` | `Open`, `Search` | +| Splits tokens when there is a transition between letters and numbers. | `T1000` | `T`, `1000` | +| Removes the possessive ('s) from the end of tokens. | `John's` | `John` | + +It's important **not** to use tokenizers that strip punctuation, like the `standard` tokenizer, with this filter. Doing so may prevent proper token splitting and interfere with options like `catenate_all` or `preserve_original`. We recommend using this filter with a `keyword` or `whitespace` tokenizer. +{: .important} + +## Parameters + +You can configure the `word_delimiter` token filter using the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`catenate_all` | Optional | Boolean | Produces concatenated tokens from a sequence of alphanumeric parts. For example, `"quick-fast-200"` becomes `[ quickfast200, quick, fast, 200 ]`. Default is `false`. +`catenate_numbers` | Optional | Boolean | Concatenates numerical sequences. For example, `"10-20-30"` becomes `[ 102030, 10, 20, 30 ]`. Default is `false`. +`catenate_words` | Optional | Boolean | Concatenates alphabetic words. For example, `"high-speed-level"` becomes `[ highspeedlevel, high, speed, level ]`. Default is `false`. +`generate_number_parts` | Optional | Boolean | If `true`, numeric tokens (tokens consisting of numbers only) are included in the output. Default is `true`. +`generate_word_parts` | Optional | Boolean | If `true`, alphabetical tokens (tokens consisting of alphabetic characters only) are included in the output. Default is `true`. +`preserve_original` | Optional | Boolean | Keeps the original token (which may include non-alphanumeric delimiters) alongside the generated tokens in the output. For example, `"auto-drive-300"` becomes `[ auto-drive-300, auto, drive, 300 ]`. If `true`, the filter generates multi-position tokens not supported by indexing, so do not use this filter in an index analyzer or use the `flatten_graph` filter after this filter. Default is `false`. +`protected_words` | Optional | Array of strings | Specifies tokens that should not be split. +`protected_words_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing tokens that should not be separated by new lines. +`split_on_case_change` | Optional | Boolean | Splits tokens where consecutive letters have different cases (one is lowercase and the other is uppercase). For example, `"OpenSearch"` becomes `[ Open, Search ]`. Default is `true`. +`split_on_numerics` | Optional | Boolean | Splits tokens where there are consecutive letters and numbers. For example `"v8engine"` will become `[ v, 8, engine ]`. Default is `true`. +`stem_english_possessive` | Optional | Boolean | Removes English possessive endings, such as `'s`. Default is `true`. +`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split on hyphens. Valid types are:
- `ALPHA`: alphabetical
- `ALPHANUM`: alphanumeric
- `DIGIT`: numeric
- `LOWER`: lowercase alphabetical
- `SUBWORD_DELIM`: non-alphanumeric delimiter
- `UPPER`: uppercase alphabetical +`type_table_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing a custom character map. The map specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For valid types, see `type_table`. + +## Example + +The following example request creates a new index named `my-custom-index` and configures an analyzer with a `word_delimiter` filter: + +```json +PUT /my-custom-index +{ + "settings": { + "analysis": { + "analyzer": { + "custom_analyzer": { + "tokenizer": "keyword", + "filter": [ "custom_word_delimiter_filter" ] + } + }, + "filter": { + "custom_word_delimiter_filter": { + "type": "word_delimiter", + "split_on_case_change": true, + "split_on_numerics": true, + "stem_english_possessive": true + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my-custom-index/_analyze +{ + "analyzer": "custom_analyzer", + "text": "FastCar's Model2023" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "Fast", + "start_offset": 0, + "end_offset": 4, + "type": "word", + "position": 0 + }, + { + "token": "Car", + "start_offset": 4, + "end_offset": 7, + "type": "word", + "position": 1 + }, + { + "token": "Model", + "start_offset": 10, + "end_offset": 15, + "type": "word", + "position": 2 + }, + { + "token": "2023", + "start_offset": 15, + "end_offset": 19, + "type": "word", + "position": 3 + } + ] +} +``` diff --git a/_analyzers/tokenizers/index.md b/_analyzers/tokenizers/index.md index e5ac796c12..f5b5ff0f25 100644 --- a/_analyzers/tokenizers/index.md +++ b/_analyzers/tokenizers/index.md @@ -2,7 +2,7 @@ layout: default title: Tokenizers nav_order: 60 -has_children: false +has_children: true has_toc: false redirect_from: - /analyzers/tokenizers/index/ @@ -56,7 +56,7 @@ Tokenizer | Description | Example `keyword` | - No-op tokenizer
- Outputs the entire string unchanged
- Can be combined with token filters, like lowercase, to normalize terms | `My repo`
becomes
`My repo` `pattern` | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms
- Uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) | `https://opensearch.org/forum`
becomes
[`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`)
Can be configured with a regex pattern `simple_pattern` | - Uses a regular expression pattern to return matching text as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | Returns an empty array by default
Must be configured with a pattern because the pattern defaults to an empty string -`simple_pattern_split` | - Uses a regular expression pattern to split the text at matches rather than returning the matches as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default
Must be configured with a pattern +`simple_pattern_split` | - Uses a regular expression pattern to split the text on matches rather than returning the matches as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default
Must be configured with a pattern `char_group` | - Parses on a set of configurable characters
- Faster than tokenizers that run regular expressions | No-op by default
Must be configured with a list of characters `path_hierarchy` | - Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy | `one/two/three`
becomes
[`one`, `one/two`, `one/two/three`] diff --git a/_analyzers/tokenizers/lowercase.md b/_analyzers/tokenizers/lowercase.md new file mode 100644 index 0000000000..5542ecbf50 --- /dev/null +++ b/_analyzers/tokenizers/lowercase.md @@ -0,0 +1,93 @@ +--- +layout: default +title: Lowercase +parent: Tokenizers +nav_order: 70 +--- + +# Lowercase tokenizer + +The `lowercase` tokenizer breaks text into terms at white space and then lowercases all the terms. Functionally, this is identical to configuring a `letter` tokenizer with a `lowercase` token filter. However, using a `lowercase` tokenizer is more efficient because the tokenizer actions are performed in a single step. + +## Example usage + +The following example request creates a new index named `my-lowercase-index` and configures an analyzer with a `lowercase` tokenizer: + +```json +PUT /my-lowercase-index +{ + "settings": { + "analysis": { + "tokenizer": { + "my_lowercase_tokenizer": { + "type": "lowercase" + } + }, + "analyzer": { + "my_lowercase_analyzer": { + "type": "custom", + "tokenizer": "my_lowercase_tokenizer" + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my-lowercase-index/_analyze +{ + "analyzer": "my_lowercase_analyzer", + "text": "This is a Test. OpenSearch 123!" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "this", + "start_offset": 0, + "end_offset": 4, + "type": "word", + "position": 0 + }, + { + "token": "is", + "start_offset": 5, + "end_offset": 7, + "type": "word", + "position": 1 + }, + { + "token": "a", + "start_offset": 8, + "end_offset": 9, + "type": "word", + "position": 2 + }, + { + "token": "test", + "start_offset": 10, + "end_offset": 14, + "type": "word", + "position": 3 + }, + { + "token": "opensearch", + "start_offset": 16, + "end_offset": 26, + "type": "word", + "position": 4 + } + ] +} +``` diff --git a/_analyzers/tokenizers/ngram.md b/_analyzers/tokenizers/ngram.md new file mode 100644 index 0000000000..08ac456267 --- /dev/null +++ b/_analyzers/tokenizers/ngram.md @@ -0,0 +1,111 @@ +--- +layout: default +title: N-gram +parent: Tokenizers +nav_order: 80 +--- + +# N-gram tokenizer + +The `ngram` tokenizer splits text into overlapping n-grams (sequences of characters) of a specified length. This tokenizer is particularly useful when you want to perform partial word matching or autocomplete search functionality because it generates substrings (character n-grams) of the original input text. + +## Example usage + +The following example request creates a new index named `my_index` and configures an analyzer with an `ngram` tokenizer: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "tokenizer": { + "my_ngram_tokenizer": { + "type": "ngram", + "min_gram": 3, + "max_gram": 4, + "token_chars": ["letter", "digit"] + } + }, + "analyzer": { + "my_ngram_analyzer": { + "type": "custom", + "tokenizer": "my_ngram_tokenizer" + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_ngram_analyzer", + "text": "OpenSearch" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + {"token": "Sea","start_offset": 0,"end_offset": 3,"type": "word","position": 0}, + {"token": "Sear","start_offset": 0,"end_offset": 4,"type": "word","position": 1}, + {"token": "ear","start_offset": 1,"end_offset": 4,"type": "word","position": 2}, + {"token": "earc","start_offset": 1,"end_offset": 5,"type": "word","position": 3}, + {"token": "arc","start_offset": 2,"end_offset": 5,"type": "word","position": 4}, + {"token": "arch","start_offset": 2,"end_offset": 6,"type": "word","position": 5}, + {"token": "rch","start_offset": 3,"end_offset": 6,"type": "word","position": 6} + ] +} +``` + +## Parameters + +The `ngram` tokenizer can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`min_gram` | Optional | Integer | The minimum length of the n-grams. Default is `1`. +`max_gram` | Optional | Integer | The maximum length of the n-grams. Default is `2`. +`token_chars` | Optional | List of strings | The character classes to be included in tokenization. Valid values are:
- `letter`
- `digit`
- `whitespace`
- `punctuation`
- `symbol`
- `custom` (You must also specify the `custom_token_chars` parameter)
Default is an empty list (`[]`), which retains all the characters. +`custom_token_chars` | Optional | String | Custom characters to be included in the tokens. + +### Maximum difference between `min_gram` and `max_gram` + +The maximum difference between `min_gram` and `max_gram` is configured using the index-level `index.max_ngram_diff` setting and defaults to `1`. + +The following example request creates an index with a custom `index.max_ngram_diff` setting: + +```json +PUT /my-index +{ + "settings": { + "index.max_ngram_diff": 2, + "analysis": { + "tokenizer": { + "my_ngram_tokenizer": { + "type": "ngram", + "min_gram": 3, + "max_gram": 5, + "token_chars": ["letter", "digit"] + } + }, + "analyzer": { + "my_ngram_analyzer": { + "type": "custom", + "tokenizer": "my_ngram_tokenizer" + } + } + } + } +} +``` +{% include copy-curl.html %} diff --git a/_analyzers/tokenizers/path-hierarchy.md b/_analyzers/tokenizers/path-hierarchy.md new file mode 100644 index 0000000000..a6609f30cd --- /dev/null +++ b/_analyzers/tokenizers/path-hierarchy.md @@ -0,0 +1,182 @@ +--- +layout: default +title: Path hierarchy +parent: Tokenizers +nav_order: 90 +--- + +# Path hierarchy tokenizer + +The `path_hierarchy` tokenizer tokenizes file-system-like paths (or similar hierarchical structures) by breaking them down into tokens at each hierarchy level. This tokenizer is particularly useful when working with hierarchical data such as file paths, URLs, or any other delimited paths. + +## Example usage + +The following example request creates a new index named `my_index` and configures an analyzer with a `path_hierarchy` tokenizer: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "tokenizer": { + "my_path_tokenizer": { + "type": "path_hierarchy" + } + }, + "analyzer": { + "my_path_analyzer": { + "type": "custom", + "tokenizer": "my_path_tokenizer" + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_path_analyzer", + "text": "/users/john/documents/report.txt" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "/users", + "start_offset": 0, + "end_offset": 6, + "type": "word", + "position": 0 + }, + { + "token": "/users/john", + "start_offset": 0, + "end_offset": 11, + "type": "word", + "position": 0 + }, + { + "token": "/users/john/documents", + "start_offset": 0, + "end_offset": 21, + "type": "word", + "position": 0 + }, + { + "token": "/users/john/documents/report.txt", + "start_offset": 0, + "end_offset": 32, + "type": "word", + "position": 0 + } + ] +} +``` + +## Parameters + +The `path_hierarchy` tokenizer can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`delimiter` | Optional | String | Specifies the character used to separate path components. Default is `/`. +`replacement` | Optional | String | Configures the character used to replace the delimiter in the tokens. Default is `/`. +`buffer_size` | Optional | Integer | Specifies the buffer size. Default is `1024`. +`reverse` | Optional | Boolean | If `true`, generates tokens in reverse order. Default is `false`. +`skip` | Optional | Integer | Specifies the number of initial tokens (levels) to skip when tokenizing. Default is `0`. + +## Example using delimiter and replacement parameters + +The following example request configures custom `delimiter` and `replacement` parameters: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "tokenizer": { + "my_path_tokenizer": { + "type": "path_hierarchy", + "delimiter": "\\", + "replacement": "\\" + } + }, + "analyzer": { + "my_path_analyzer": { + "type": "custom", + "tokenizer": "my_path_tokenizer" + } + } + } + } +} +``` +{% include copy-curl.html %} + + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_path_analyzer", + "text": "C:\\users\\john\\documents\\report.txt" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "C:", + "start_offset": 0, + "end_offset": 2, + "type": "word", + "position": 0 + }, + { + "token": """C:\users""", + "start_offset": 0, + "end_offset": 8, + "type": "word", + "position": 0 + }, + { + "token": """C:\users\john""", + "start_offset": 0, + "end_offset": 13, + "type": "word", + "position": 0 + }, + { + "token": """C:\users\john\documents""", + "start_offset": 0, + "end_offset": 23, + "type": "word", + "position": 0 + }, + { + "token": """C:\users\john\documents\report.txt""", + "start_offset": 0, + "end_offset": 34, + "type": "word", + "position": 0 + } + ] +} +``` \ No newline at end of file diff --git a/_analyzers/tokenizers/pattern.md b/_analyzers/tokenizers/pattern.md new file mode 100644 index 0000000000..036dd9050f --- /dev/null +++ b/_analyzers/tokenizers/pattern.md @@ -0,0 +1,167 @@ +--- +layout: default +title: Pattern +parent: Tokenizers +nav_order: 100 +--- + +# Pattern tokenizer + +The `pattern` tokenizer is a highly flexible tokenizer that allows you to split text into tokens based on a custom Java regular expression. Unlike the `simple_pattern` and `simple_pattern_split` tokenizers, which use Lucene regular expressions, the `pattern` tokenizer can handle more complex and detailed regex patterns, offering greater control over how the text is tokenized. + +## Example usage + +The following example request creates a new index named `my_index` and configures an analyzer with a `pattern` tokenizer. The tokenizer splits text on `-`, `_`, or `.` characters: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "tokenizer": { + "my_pattern_tokenizer": { + "type": "pattern", + "pattern": "[-_.]" + } + }, + "analyzer": { + "my_pattern_analyzer": { + "type": "custom", + "tokenizer": "my_pattern_tokenizer" + } + } + } + }, + "mappings": { + "properties": { + "content": { + "type": "text", + "analyzer": "my_pattern_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_pattern_analyzer", + "text": "OpenSearch-2024_v1.2" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "OpenSearch", + "start_offset": 0, + "end_offset": 10, + "type": "word", + "position": 0 + }, + { + "token": "2024", + "start_offset": 11, + "end_offset": 15, + "type": "word", + "position": 1 + }, + { + "token": "v1", + "start_offset": 16, + "end_offset": 18, + "type": "word", + "position": 2 + }, + { + "token": "2", + "start_offset": 19, + "end_offset": 20, + "type": "word", + "position": 3 + } + ] +} +``` + +## Parameters + +The `pattern` tokenizer can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`pattern` | Optional | String | The pattern used to split text into tokens, specified using a [Java regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). Default is `\W+`. +`flags` | Optional | String | Configures pipe-separated [flags](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary) to apply to the regular expression, for example, `"CASE_INSENSITIVE|MULTILINE|DOTALL"`. +`group` | Optional | Integer | Specifies the capture group to be used as a token. Default is `-1` (split on a match). + +## Example using a group parameter + +The following example request configures a `group` parameter that captures only the second group: + +```json +PUT /my_index_group2 +{ + "settings": { + "analysis": { + "tokenizer": { + "my_pattern_tokenizer": { + "type": "pattern", + "pattern": "([a-zA-Z]+)(\\d+)", + "group": 2 + } + }, + "analyzer": { + "my_pattern_analyzer": { + "type": "custom", + "tokenizer": "my_pattern_tokenizer" + } + } + } + } +} +``` +{% include copy-curl.html %} + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index_group2/_analyze +{ + "analyzer": "my_pattern_analyzer", + "text": "abc123def456ghi" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "123", + "start_offset": 3, + "end_offset": 6, + "type": "word", + "position": 0 + }, + { + "token": "456", + "start_offset": 9, + "end_offset": 12, + "type": "word", + "position": 1 + } + ] +} +``` \ No newline at end of file diff --git a/_analyzers/tokenizers/simple-pattern-split.md b/_analyzers/tokenizers/simple-pattern-split.md new file mode 100644 index 0000000000..25367f25b5 --- /dev/null +++ b/_analyzers/tokenizers/simple-pattern-split.md @@ -0,0 +1,105 @@ +--- +layout: default +title: Simple pattern split +parent: Tokenizers +nav_order: 120 +--- + +# Simple pattern split tokenizer + +The `simple_pattern_split` tokenizer uses a regular expression to split text into tokens. The regular expression defines the pattern used to determine where to split the text. Any matching pattern in the text is used as a delimiter, and the text between delimiters becomes a token. Use this tokenizer when you want to define delimiters and tokenize the rest of the text based on a pattern. + +The tokenizer uses the matched parts of the input text (based on the regular expression) only as delimiters or boundaries to split the text into terms. The matched portions are not included in the resulting terms. For example, if the tokenizer is configured to split text at dot characters (`.`) and the input text is `one.two.three`, then the generated terms are `one`, `two`, and `three`. The dot characters themselves are not included in the resulting terms. + +## Example usage + +The following example request creates a new index named `my_index` and configures an analyzer with a `simple_pattern_split` tokenizer. The tokenizer is configured to split text on hyphens: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "tokenizer": { + "my_pattern_split_tokenizer": { + "type": "simple_pattern_split", + "pattern": "-" + } + }, + "analyzer": { + "my_pattern_split_analyzer": { + "type": "custom", + "tokenizer": "my_pattern_split_tokenizer" + } + } + } + }, + "mappings": { + "properties": { + "content": { + "type": "text", + "analyzer": "my_pattern_split_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_pattern_split_analyzer", + "text": "OpenSearch-2024-10-09" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "OpenSearch", + "start_offset": 0, + "end_offset": 10, + "type": "word", + "position": 0 + }, + { + "token": "2024", + "start_offset": 11, + "end_offset": 15, + "type": "word", + "position": 1 + }, + { + "token": "10", + "start_offset": 16, + "end_offset": 18, + "type": "word", + "position": 2 + }, + { + "token": "09", + "start_offset": 19, + "end_offset": 21, + "type": "word", + "position": 3 + } + ] +} +``` + +## Parameters + +The `simple_pattern_split` tokenizer can be configured with the following parameter. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`pattern` | Optional | String | The pattern used to split text into tokens, specified using a [Lucene regular expression](https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/util/automaton/RegExp.html). Default is an empty string, which returns the input text as one token. \ No newline at end of file diff --git a/_analyzers/tokenizers/simple-pattern.md b/_analyzers/tokenizers/simple-pattern.md new file mode 100644 index 0000000000..eacddd6992 --- /dev/null +++ b/_analyzers/tokenizers/simple-pattern.md @@ -0,0 +1,89 @@ +--- +layout: default +title: Simple pattern +parent: Tokenizers +nav_order: 110 +--- + +# Simple pattern tokenizer + +The `simple_pattern` tokenizer identifies matching sequences in text based on a regular expression and uses those sequences as tokens. It extracts terms that match the regular expression. Use this tokenizer when you want to directly extract specific patterns as terms. + +## Example usage + +The following example request creates a new index named `my_index` and configures an analyzer with a `simple_pattern` tokenizer. The tokenizer extracts numeric terms from text: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "tokenizer": { + "my_pattern_tokenizer": { + "type": "simple_pattern", + "pattern": "\\d+" + } + }, + "analyzer": { + "my_pattern_analyzer": { + "type": "custom", + "tokenizer": "my_pattern_tokenizer" + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_pattern_analyzer", + "text": "OpenSearch-2024-10-09" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "2024", + "start_offset": 11, + "end_offset": 15, + "type": "word", + "position": 0 + }, + { + "token": "10", + "start_offset": 16, + "end_offset": 18, + "type": "word", + "position": 1 + }, + { + "token": "09", + "start_offset": 19, + "end_offset": 21, + "type": "word", + "position": 2 + } + ] +} +``` + +## Parameters + +The `simple_pattern` tokenizer can be configured with the following parameter. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`pattern` | Optional | String | The pattern used to split text into tokens, specified using a [Lucene regular expression](https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/util/automaton/RegExp.html). Default is an empty string, which returns the input text as one token. + diff --git a/_analyzers/tokenizers/standard.md b/_analyzers/tokenizers/standard.md new file mode 100644 index 0000000000..c10f25802b --- /dev/null +++ b/_analyzers/tokenizers/standard.md @@ -0,0 +1,111 @@ +--- +layout: default +title: Standard +parent: Tokenizers +nav_order: 130 +--- + +# Standard tokenizer + +The `standard` tokenizer is the default tokenizer in OpenSearch. It tokenizes text based on word boundaries using a grammar-based approach that recognizes letters, digits, and other characters like punctuation. It is highly versatile and suitable for many languages because it uses Unicode text segmentation rules ([UAX#29](https://unicode.org/reports/tr29/)) to break text into tokens. + +## Example usage + +The following example request creates a new index named `my_index` and configures an analyzer with a `standard` tokenizer: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_standard_analyzer": { + "type": "standard" + } + } + } + }, + "mappings": { + "properties": { + "content": { + "type": "text", + "analyzer": "my_standard_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_standard_analyzer", + "text": "OpenSearch is powerful, fast, and scalable." +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "opensearch", + "start_offset": 0, + "end_offset": 10, + "type": "", + "position": 0 + }, + { + "token": "is", + "start_offset": 11, + "end_offset": 13, + "type": "", + "position": 1 + }, + { + "token": "powerful", + "start_offset": 14, + "end_offset": 22, + "type": "", + "position": 2 + }, + { + "token": "fast", + "start_offset": 24, + "end_offset": 28, + "type": "", + "position": 3 + }, + { + "token": "and", + "start_offset": 30, + "end_offset": 33, + "type": "", + "position": 4 + }, + { + "token": "scalable", + "start_offset": 34, + "end_offset": 42, + "type": "", + "position": 5 + } + ] +} +``` + +## Parameters + +The `standard` tokenizer can be configured with the following parameter. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`. + diff --git a/_analyzers/tokenizers/thai.md b/_analyzers/tokenizers/thai.md new file mode 100644 index 0000000000..4afb14a9eb --- /dev/null +++ b/_analyzers/tokenizers/thai.md @@ -0,0 +1,108 @@ +--- +layout: default +title: Thai +parent: Tokenizers +nav_order: 140 +--- + +# Thai tokenizer + +The `thai` tokenizer tokenizes Thai language text. Because words in Thai language are not separated by spaces, the tokenizer must identify word boundaries based on language-specific rules. + +## Example usage + +The following example request creates a new index named `thai_index` and configures an analyzer with a `thai` tokenizer: + +```json +PUT /thai_index +{ + "settings": { + "analysis": { + "tokenizer": { + "thai_tokenizer": { + "type": "thai" + } + }, + "analyzer": { + "thai_analyzer": { + "type": "custom", + "tokenizer": "thai_tokenizer" + } + } + } + }, + "mappings": { + "properties": { + "content": { + "type": "text", + "analyzer": "thai_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /thai_index/_analyze +{ + "analyzer": "thai_analyzer", + "text": "ฉันชอบไปเที่ยวที่เชียงใหม่" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "ฉัน", + "start_offset": 0, + "end_offset": 3, + "type": "word", + "position": 0 + }, + { + "token": "ชอบ", + "start_offset": 3, + "end_offset": 6, + "type": "word", + "position": 1 + }, + { + "token": "ไป", + "start_offset": 6, + "end_offset": 8, + "type": "word", + "position": 2 + }, + { + "token": "เที่ยว", + "start_offset": 8, + "end_offset": 14, + "type": "word", + "position": 3 + }, + { + "token": "ที่", + "start_offset": 14, + "end_offset": 17, + "type": "word", + "position": 4 + }, + { + "token": "เชียงใหม่", + "start_offset": 17, + "end_offset": 26, + "type": "word", + "position": 5 + } + ] +} +``` diff --git a/_analyzers/tokenizers/uax-url-email.md b/_analyzers/tokenizers/uax-url-email.md new file mode 100644 index 0000000000..34336a4f55 --- /dev/null +++ b/_analyzers/tokenizers/uax-url-email.md @@ -0,0 +1,84 @@ +--- +layout: default +title: UAX URL email +parent: Tokenizers +nav_order: 150 +--- + +# UAX URL email tokenizer + +In addition to regular text, the `uax_url_email` tokenizer is designed to handle URLs, email addresses, and domain names. It is based on the Unicode Text Segmentation algorithm ([UAX #29](https://www.unicode.org/reports/tr29/)), which allows it to correctly tokenize complex text, including URLs and email addresses. + +## Example usage + +The following example request creates a new index named `my_index` and configures an analyzer with a `uax_url_email` tokenizer: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "tokenizer": { + "uax_url_email_tokenizer": { + "type": "uax_url_email" + } + }, + "analyzer": { + "my_uax_analyzer": { + "type": "custom", + "tokenizer": "uax_url_email_tokenizer" + } + } + } + }, + "mappings": { + "properties": { + "content": { + "type": "text", + "analyzer": "my_uax_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_uax_analyzer", + "text": "Contact us at support@example.com or visit https://example.com for details." +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + {"token": "Contact","start_offset": 0,"end_offset": 7,"type": "","position": 0}, + {"token": "us","start_offset": 8,"end_offset": 10,"type": "","position": 1}, + {"token": "at","start_offset": 11,"end_offset": 13,"type": "","position": 2}, + {"token": "support@example.com","start_offset": 14,"end_offset": 33,"type": "","position": 3}, + {"token": "or","start_offset": 34,"end_offset": 36,"type": "","position": 4}, + {"token": "visit","start_offset": 37,"end_offset": 42,"type": "","position": 5}, + {"token": "https://example.com","start_offset": 43,"end_offset": 62,"type": "","position": 6}, + {"token": "for","start_offset": 63,"end_offset": 66,"type": "","position": 7}, + {"token": "details","start_offset": 67,"end_offset": 74,"type": "","position": 8} + ] +} +``` + +## Parameters + +The `uax_url_email` tokenizer can be configured with the following parameter. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`. + diff --git a/_analyzers/tokenizers/whitespace.md b/_analyzers/tokenizers/whitespace.md new file mode 100644 index 0000000000..fb168304a7 --- /dev/null +++ b/_analyzers/tokenizers/whitespace.md @@ -0,0 +1,110 @@ +--- +layout: default +title: Whitespace +parent: Tokenizers +nav_order: 160 +--- + +# Whitespace tokenizer + +The `whitespace` tokenizer splits text on white space characters, such as spaces, tabs, and new lines. It treats each word separated by white space as a token and does not perform any additional analysis or normalization like lowercasing or punctuation removal. + +## Example usage + +The following example request creates a new index named `my_index` and configures an analyzer with a `whitespace` tokenizer: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "tokenizer": { + "whitespace_tokenizer": { + "type": "whitespace" + } + }, + "analyzer": { + "my_whitespace_analyzer": { + "type": "custom", + "tokenizer": "whitespace_tokenizer" + } + } + } + }, + "mappings": { + "properties": { + "content": { + "type": "text", + "analyzer": "my_whitespace_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_whitespace_analyzer", + "text": "OpenSearch is fast! Really fast." +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "OpenSearch", + "start_offset": 0, + "end_offset": 10, + "type": "word", + "position": 0 + }, + { + "token": "is", + "start_offset": 11, + "end_offset": 13, + "type": "word", + "position": 1 + }, + { + "token": "fast!", + "start_offset": 14, + "end_offset": 19, + "type": "word", + "position": 2 + }, + { + "token": "Really", + "start_offset": 20, + "end_offset": 26, + "type": "word", + "position": 3 + }, + { + "token": "fast.", + "start_offset": 27, + "end_offset": 32, + "type": "word", + "position": 4 + } + ] +} +``` + +## Parameters + +The `whitespace` tokenizer can be configured with the following parameter. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`. + diff --git a/_api-reference/common-parameters.md b/_api-reference/common-parameters.md index 5b536ad992..ac3efbf4bf 100644 --- a/_api-reference/common-parameters.md +++ b/_api-reference/common-parameters.md @@ -123,4 +123,17 @@ Kilometers | `km` or `kilometers` Meters | `m` or `meters` Centimeters | `cm` or `centimeters` Millimeters | `mm` or `millimeters` -Nautical miles | `NM`, `nmi`, or `nauticalmiles` \ No newline at end of file +Nautical miles | `NM`, `nmi`, or `nauticalmiles` + +## `X-Opaque-Id` header + +You can specify an opaque identifier for any request using the `X-Opaque-Id` header. This identifier is used to track tasks and deduplicate deprecation warnings in server-side logs. This identifier is used to differentiate between callers sending requests to your OpenSearch cluster. Do not specify a unique value per request. + +#### Example request + +The following request adds an opaque ID to the request: + +```json +curl -H "X-Opaque-Id: my-curl-client-1" -XGET localhost:9200/_tasks +``` +{% include copy.html %} diff --git a/_api-reference/document-apis/update-document.md b/_api-reference/document-apis/update-document.md index ff17940cdb..8500fec101 100644 --- a/_api-reference/document-apis/update-document.md +++ b/_api-reference/document-apis/update-document.md @@ -14,6 +14,14 @@ redirect_from: If you need to update a document's fields in your index, you can use the update document API operation. You can do so by specifying the new data you want to be in your index or by including a script in your request body, which OpenSearch runs to update the document. By default, the update operation only updates a document that exists in the index. If a document does not exist, the API returns an error. To _upsert_ a document (update the document that exists or index a new one), use the [upsert](#using-the-upsert-operation) operation. +You cannot explicitly specify an ingest pipeline when calling the Update Document API. If a `default_pipeline` or `final_pipeline` is defined in your index, the following behavior applies: + +- **Upsert operations**: When indexing a new document, the `default_pipeline` and `final_pipeline` defined in the index are executed as specified. +- **Update operations**: When updating an existing document, ingest pipeline execution is not recommended because it may produce erroneous results. Support for running ingest pipelines during update operations is deprecated and will be removed in version 3.0.0. If your index has a defined ingest pipeline, the update document operation will return the following deprecation warning: +``` +the index [sample-index1] has a default ingest pipeline or a final ingest pipeline, the support of the ingest pipelines for update operation causes unexpected result and will be removed in 3.0.0 +``` + ## Path and HTTP methods ```json diff --git a/_benchmark/reference/commands/aggregate.md b/_benchmark/reference/commands/aggregate.md index 17612f1164..a891bf3edf 100644 --- a/_benchmark/reference/commands/aggregate.md +++ b/_benchmark/reference/commands/aggregate.md @@ -69,9 +69,30 @@ Aggregate test execution ID: aggregate_results_geonames_9aafcfb8-d3b7-4583-864e ------------------------------- ``` -The results will be aggregated into one test execution and stored under the ID shown in the output: +The results will be aggregated into one test execution and stored under the ID shown in the output. +### Additional options - `--test-execution-id`: Define a unique ID for the aggregated test execution. - `--results-file`: Write the aggregated results to the provided file. - `--workload-repository`: Define the repository from which OpenSearch Benchmark will load workloads (default is `default`). +## Aggregated results + +Aggregated results includes the following information: + +- **Relative Standard Deviation (RSD)**: For each metric an additional `mean_rsd` value shows the spread of results across test executions. +- **Overall min/max values**: Instead of averaging minimum and maximum values, the aggregated result include `overall_min` and `overall_max` which reflect the true minimum/maximum across all test runs. +- **Storage**: Aggregated test results are stored in a separate `aggregated_results` folder alongside the `test_executions` folder. + +The following example shows aggregated results: + +```json + "throughput": { + "overall_min": 29056.890292903263, + "mean": 50115.8603858536, + "median": 50099.54349684457, + "overall_max": 72255.15946248993, + "unit": "docs/s", + "mean_rsd": 59.426059705973664 + }, +``` diff --git a/_config.yml b/_config.yml index 0e45176320..3c6f737cc8 100644 --- a/_config.yml +++ b/_config.yml @@ -31,9 +31,6 @@ collections: install-and-configure: permalink: /:collection/:path/ output: true - upgrade-to: - permalink: /:collection/:path/ - output: true im-plugin: permalink: /:collection/:path/ output: true @@ -94,6 +91,9 @@ collections: data-prepper: permalink: /:collection/:path/ output: true + migration-assistant: + permalink: /:collection/:path/ + output: true tools: permalink: /:collection/:path/ output: true @@ -137,11 +137,6 @@ opensearch_collection: install-and-configure: name: Install and upgrade nav_fold: true - upgrade-to: - name: Migrate to OpenSearch - # nav_exclude: true - nav_fold: true - # search_exclude: true im-plugin: name: Managing Indexes nav_fold: true @@ -213,6 +208,12 @@ clients_collection: name: Clients nav_fold: true +migration_assistant_collection: + collections: + migration-assistant: + name: Migration Assistant + nav_fold: true + benchmark_collection: collections: benchmark: @@ -252,6 +253,12 @@ defaults: values: section: "benchmark" section-name: "Benchmark" + - + scope: + path: "_migration-assistant" + values: + section: "migration-assistant" + section-name: "Migration Assistant" # Enable or disable the site search # By default, just-the-docs enables its JSON file-based search. We also have an OpenSearch-driven search functionality. diff --git a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md index 9628bb6caf..ba574bdf7d 100644 --- a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md +++ b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md @@ -53,6 +53,7 @@ You can configure `random_cut_forest` mode with the following options. | `sample_size` | `256` | 100--2500 | The sample size used in the ML algorithm. | | `time_decay` | `0.1` | 0--1.0 | The time decay value used in the ML algorithm. Used as the mathematical expression `timeDecay` divided by `SampleSize` in the ML algorithm. | | `type` | `metrics` | N/A | The type of data sent to the algorithm. | +| `output_after` | 32 | N/A | Specifies the number of events to process before outputting any detected anomalies. | | `version` | `1.0` | N/A | The algorithm version number. | ## Usage diff --git a/_data-prepper/pipelines/configuration/sources/s3.md b/_data-prepper/pipelines/configuration/sources/s3.md index db92718a36..7ca27ee500 100644 --- a/_data-prepper/pipelines/configuration/sources/s3.md +++ b/_data-prepper/pipelines/configuration/sources/s3.md @@ -104,7 +104,7 @@ Option | Required | Type | Description `s3_select` | No | [s3_select](#s3_select) | The Amazon S3 Select configuration. `scan` | No | [scan](#scan) | The S3 scan configuration. `delete_s3_objects_on_read` | No | Boolean | When `true`, the S3 scan attempts to delete S3 objects after all events from the S3 object are successfully acknowledged by all sinks. `acknowledgments` should be enabled when deleting S3 objects. Default is `false`. -`workers` | No | Integer | Configures the number of worker threads that the source uses to read data from S3. Leave this value as the default unless your S3 objects are less than 1 MB in size. Performance may decrease for larger S3 objects. This setting affects SQS-based sources and S3-Scan sources. Default is `1`. +`workers` | No | Integer | Configures the number of worker threads (1--10) that the source uses to read data from S3. Leave this value as the default unless your S3 objects are less than 1 MB in size. Performance may decrease for larger S3 objects. This setting affects SQS-based sources and S3-Scan sources. Default is `1`. diff --git a/_data-prepper/pipelines/expression-syntax.md b/_data-prepper/pipelines/expression-syntax.md index 383b54c19b..07f68ee58e 100644 --- a/_data-prepper/pipelines/expression-syntax.md +++ b/_data-prepper/pipelines/expression-syntax.md @@ -30,6 +30,9 @@ The following table lists the supported operators. Operators are listed in order |----------------------|-------------------------------------------------------|---------------| | `()` | Priority expression | Left to right | | `not`
`+`
`-`| Unary logical NOT
Unary positive
Unary negative | Right to left | +| `*`, `/` | Multiplication and division operators | Left to right | +| `+`, `-` | Addition and subtraction operators | Left to right | +| `+` | String concatenation operator | Left to right | | `<`, `<=`, `>`, `>=` | Relational operators | Left to right | | `==`, `!=` | Equality operators | Left to right | | `and`, `or` | Conditional expression | Left to right | @@ -78,7 +81,6 @@ Conditional expressions allow you to combine multiple expressions or values usin or not ``` -{% include copy-curl.html %} The following are some example conditional expressions: @@ -91,9 +93,64 @@ not /status_code in {200, 202} ``` {% include copy-curl.html %} +### Arithmetic expressions + +Arithmetic expressions enable basic mathematical operations like addition, subtraction, multiplication, and division. These expressions can be combined with conditional expressions to create more complex conditional statements. The available arithmetic operators are +, -, *, and /. The syntax for using the arithmetic operators is as follows: + +``` + + + - + * + / +``` + +The following are example arithmetic expressions: + +``` +/value + length(/message) +/bytes / 1024 +/value1 - /value2 +/TimeInSeconds * 1000 +``` +{% include copy-curl.html %} + +The following are some example arithmetic expressions used in conditional expressions : + +``` +/value + length(/message) > 200 +/bytes / 1024 < 10 +/value1 - /value2 != /value3 + /value4 +``` +{% include copy-curl.html %} + +### String concatenation expressions + +String concatenation expressions enable you to combine strings to create new strings. These concatenated strings can also be used within conditional expressions. The syntax for using string concatenation is as follows: + +``` + + +``` + +The following are example string concatenation expressions: + +``` +/name + "suffix" +"prefix" + /name +"time of " + /timeInMs + " ms" +``` +{% include copy-curl.html %} + +The following are example string concatenation expressions that can be used in conditional expressions: + +``` +/service + ".com" == /url +"www." + /service != /url +``` +{% include copy-curl.html %} + ### Reserved symbols -Reserved symbols are symbols that are not currently used in the expression syntax but are reserved for possible future functionality or extensions. Reserved symbols include `^`, `*`, `/`, `%`, `+`, `-`, `xor`, `=`, `+=`, `-=`, `*=`, `/=`, `%=`, `++`, `--`, and `${}`. +Certain symbols, such as ^, %, xor, =, +=, -=, *=, /=, %=, ++, --, and ${}, are reserved for future functionality or extensions. Reserved symbols include `^`, `%`, `xor`, `=`, `+=`, `-=`, `*=`, `/=`, `%=`, `++`, `--`, and `${}`. ## Syntax components @@ -170,6 +227,9 @@ White space is optional around relational operators, regex equality operators, e | `()` | Priority expression | Yes | `/a==(/b==200)`
`/a in ({200})` | `/status in({200})` | | `in`, `not in` | Set operators | Yes | `/a in {200}`
`/a not in {400}` | `/a in{200, 202}`
`/a not in{400}` | | `<`, `<=`, `>`, `>=` | Relational operators | No | `/status < 300`
`/status>=300` | | +| `+` | String concatenation operator | No | `/status_code + /message + "suffix"` +| `+`, `-` | Arithmetic addition and subtraction operators | No | `/status_code + length(/message) - 2` +| `*`, `/` | Multiplication and division operators | No | `/status_code * length(/message) / 3` | `=~`, `!~` | Regex equality operators | No | `/msg =~ "^\w*$"`
`/msg=~"^\w*$"` | | | `==`, `!=` | Equality operators | No | `/status == 200`
`/status_code==200` | | | `and`, `or`, `not` | Conditional operators | Yes | `/a<300 and /b>200` | `/b<300and/b>200` | diff --git a/_data/top_nav.yml b/_data/top_nav.yml index 51d8138680..6552d90359 100644 --- a/_data/top_nav.yml +++ b/_data/top_nav.yml @@ -63,6 +63,8 @@ items: url: /docs/latest/clients/ - label: Benchmark url: /docs/latest/benchmark/ + - label: Migration Assistant + url: /docs/latest/migration-assistant/ - label: Platform url: /platform/index.html children: diff --git a/_field-types/supported-field-types/flat-object.md b/_field-types/supported-field-types/flat-object.md index c9e59710e1..65d7c6dc8e 100644 --- a/_field-types/supported-field-types/flat-object.md +++ b/_field-types/supported-field-types/flat-object.md @@ -56,7 +56,8 @@ The flat object field type supports the following queries: - [Multi-match]({{site.url}}{{site.baseurl}}/query-dsl/full-text/multi-match/) - [Query string]({{site.url}}{{site.baseurl}}/query-dsl/full-text/query-string/) - [Simple query string]({{site.url}}{{site.baseurl}}/query-dsl/full-text/simple-query-string/) -- [Exists]({{site.url}}{{site.baseurl}}/query-dsl/term/exists/) +- [Exists]({{site.url}}{{site.baseurl}}/query-dsl/term/exists/) +- [Wildcard]({{site.url}}{{site.baseurl}}/query-dsl/term/wildcard/) ## Limitations @@ -243,4 +244,4 @@ PUT /test-index/ ``` {% include copy-curl.html %} -Because `issue.number` is not part of the flat object, you can use it to aggregate and sort documents. \ No newline at end of file +Because `issue.number` is not part of the flat object, you can use it to aggregate and sort documents. diff --git a/_im-plugin/index-transforms/transforms-apis.md b/_im-plugin/index-transforms/transforms-apis.md index 37d2c035b5..7e0803c38b 100644 --- a/_im-plugin/index-transforms/transforms-apis.md +++ b/_im-plugin/index-transforms/transforms-apis.md @@ -177,8 +177,8 @@ The update operation supports the following query parameters: Parameter | Description | Required :---| :--- | :--- -`seq_no` | Only perform the transform operation if the last operation that changed the transform job has the specified sequence number. | Yes -`primary_term` | Only perform the transform operation if the last operation that changed the transform job has the specified sequence term. | Yes +`if_seq_no` | Only perform the transform operation if the last operation that changed the transform job has the specified sequence number. | Yes +`if_primary_term` | Only perform the transform operation if the last operation that changed the transform job has the specified sequence term. | Yes ### Request body fields diff --git a/_includes/cards.html b/_includes/cards.html index 6d958e61a5..3fa1809506 100644 --- a/_includes/cards.html +++ b/_includes/cards.html @@ -30,8 +30,14 @@

Measure performance metrics for your OpenSearch cluster

+ +
+ +

Migration Assistant

+

Migrate to OpenSearch

+ +
- diff --git a/_install-and-configure/install-opensearch/index.md b/_install-and-configure/install-opensearch/index.md index 94c259667a..bfaf9897d6 100644 --- a/_install-and-configure/install-opensearch/index.md +++ b/_install-and-configure/install-opensearch/index.md @@ -29,7 +29,7 @@ The OpenSearch distribution for Linux ships with a compatible [Adoptium JDK](htt OpenSearch Version | Compatible Java Versions | Bundled Java Version :---------- | :-------- | :----------- 1.0--1.2.x | 11, 15 | 15.0.1+9 -1.3.x | 8, 11, 14 | 11.0.24+8 +1.3.x | 8, 11, 14 | 11.0.25+9 2.0.0--2.11.x | 11, 17 | 17.0.2+8 2.12.0+ | 11, 17, 21 | 21.0.5+11 diff --git a/_install-and-configure/plugins.md b/_install-and-configure/plugins.md index e96b29e822..055d451081 100644 --- a/_install-and-configure/plugins.md +++ b/_install-and-configure/plugins.md @@ -10,9 +10,9 @@ redirect_from: # Installing plugins -OpenSearch comprises of a number of plugins that add features and capabilities to the core platform. The plugins available to you are dependent on how OpenSearch was installed and which plugins were subsequently added or removed. For example, the minimal distribution of OpenSearch enables only core functionality, such as indexing and search. Using the minimal distribution of OpenSearch is beneficial when you are working in a testing environment, have custom plugins, or are intending to integrate OpenSearch with other services. +OpenSearch includes a number of plugins that add features and capabilities to the core platform. The plugins available to you are dependent on how OpenSearch was installed and which plugins were subsequently added or removed. For example, the minimal distribution of OpenSearch enables only core functionality, such as indexing and search. Using the minimal distribution of OpenSearch is beneficial when you are working in a testing environment, have custom plugins, or are intending to integrate OpenSearch with other services. -The standard distribution of OpenSearch has much more functionality included. You can choose to add additional plugins or remove any of the plugins you don't need. +The standard distribution of OpenSearch includes many more plugins offering much more functionality. You can choose to add additional plugins or remove any of the plugins you don't need. For a list of the available plugins, see [Available plugins](#available-plugins). diff --git a/_layouts/default.html b/_layouts/default.html index d4d40d8cc4..7f2bf0a2a8 100755 --- a/_layouts/default.html +++ b/_layouts/default.html @@ -87,6 +87,8 @@ {% assign section = site.clients_collection.collections %} {% elsif page.section == "benchmark" %} {% assign section = site.benchmark_collection.collections %} + {% elsif page.section == "migration-assistant" %} + {% assign section = site.migration_assistant_collection.collections %} {% endif %} {% if section %} diff --git a/_migration-assistant/deploying-migration-assistant/configuration-options.md b/_migration-assistant/deploying-migration-assistant/configuration-options.md new file mode 100644 index 0000000000..7097d7e90e --- /dev/null +++ b/_migration-assistant/deploying-migration-assistant/configuration-options.md @@ -0,0 +1,175 @@ +--- +layout: default +title: Configuration options +nav_order: 15 +parent: Deploying Migration Assistant +--- + +# Configuration options + +This page outlines the configuration options for three key migrations scenarios: + +1. **Metadata migration** +2. **Backfill migration with `Reindex-from-Snapshot` (RFS)** +3. **Live capture migration with Capture and Replay (C&R)** + +Each of these migrations depends on either a snapshot or a capture proxy. The following example `cdk.context.json` configurations are used by AWS Cloud Development Kit (AWS CDK) to deploy and configure Migration Assistant for OpenSearch, shown as separate blocks for each migration type. If you are performing a migration applicable to multiple scenarios, these options can be combined. + + +For a complete list of configuration options, see [opensearch-migrations-options.md](https://github.com/opensearch-project/opensearch-migrations/blob/main/deployment/cdk/opensearch-service-migration/options.md). If you need a configuration option that is not found on this page, create an issue in the [OpenSearch Migrations repository](https://github.com/opensearch-project/opensearch-migrations/issues). +{: .tip } + +Options for the source cluster endpoint, target cluster endpoint, and existing virtual private cloud (VPC) should be configured in order for the migration tools to function effectively. + +## Shared configuration options + +Each migration configuration shares the following options. + + +| Name | Example | Description | +| :--- | :--- | :--- | +| `sourceClusterEndpoint` | `"https://source-cluster.elb.us-east-1.endpoint.com"` | The endpoint for the source cluster. | +| `targetClusterEndpoint` | `"https://vpc-demo-opensearch-cluster-cv6hggdb66ybpk4kxssqt6zdhu.us-west-2.es.amazonaws.com:443"` | The endpoint for the target cluster. Required if using an existing target cluster for the migration instead of creating a new one. | +| `vpcId` | `"vpc-123456789abcdefgh"` | The ID of the existing VPC in which the migration resources will be stored. The VPC must have at least two private subnets that span two Availability Zones. | + + +## Backfill migration using RFS + +The following CDK performs a backfill migrations using RFS: + +```json +{ + "backfill-migration": { + "stage": "dev", + "vpcId": , + "sourceCluster": { + "endpoint": , + "version": "ES 7.10", + "auth": {"type": "none"} + }, + "targetCluster": { + "endpoint": , + "auth": { + "type": "basic", + "username": , + "passwordFromSecretArn": + } + }, + "reindexFromSnapshotServiceEnabled": true, + "reindexFromSnapshotExtraArgs": "", + "artifactBucketRemovalPolicy": "DESTROY" + } +} +``` +{% include copy.html %} + +Performing an RFS backfill migration requires an existing snapshot. + + +The RFS configuration uses the following options. All options are optional. + +| Name | Example | Description | +| :--- | :--- | :--- | +| `reindexFromSnapshotServiceEnabled` | `true` | Enables deployment and configuration of the RFS ECS service. | +| `reindexFromSnapshotExtraArgs` | `"--target-aws-region us-east-1 --target-aws-service-signing-name es"` | Extra arguments for the Document Migration command, with space separation. See [RFS Extra Arguments](https://github.com/opensearch-project/opensearch-migrations/blob/main/DocumentsFromSnapshotMigration/README.md#arguments) for more information. You can pass `--no-insecure` to remove the `--insecure` flag. | + +To view all available arguments for `reindexFromSnapshotExtraArgs`, see [Snapshot migrations README](https://github.com/opensearch-project/opensearch-migrations/blob/main/DocumentsFromSnapshotMigration/README.md#arguments). At a minimum, no extra arguments may be needed. + +## Live capture migration with C&R + +The following sample CDK performs a live capture migration with C&R: + +```json +{ + "live-capture-migration": { + "stage": "dev", + "vpcId": , + "sourceCluster": { + "endpoint": , + "version": "ES 7.10", + "auth": {"type": "none"} + }, + "targetCluster": { + "endpoint": , + "auth": { + "type": "basic", + "username": , + "passwordFromSecretArn": + } + }, + "captureProxyServiceEnabled": true, + "captureProxyExtraArgs": "", + "trafficReplayerServiceEnabled": true, + "trafficReplayerExtraArgs": "", + "artifactBucketRemovalPolicy": "DESTROY" + } +} +``` +{% include copy.html %} + +Performing a live capture migration requires that a Capture Proxy be configured to capture incoming traffic and send it to the target cluster using the Traffic Replayer service. For arguments available in `captureProxyExtraArgs`, refer to the `@Parameter` fields [here](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/CaptureProxy.java). For `trafficReplayerExtraArgs`, refer to the `@Parameter` fields [here](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java). At a minimum, no extra arguments may be needed. + + +| Name | Example | Description | +| :--- | :--- | :--- | +| `captureProxyServiceEnabled` | `true` | Enables the Capture Proxy service deployment using an AWS CloudFormation stack. | +| `captureProxyExtraArgs` | `"--suppressCaptureForHeaderMatch user-agent .*elastic-java/7.17.0.*"` | Extra arguments for the Capture Proxy command, including options specified by the [Capture Proxy](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/CaptureProxy.java). | +| `trafficReplayerServiceEnabled` | `true` | Enables the Traffic Replayer service deployment using a CloudFormation stack. | +| `trafficReplayerExtraArgs` | `"--sigv4-auth-header-service-region es,us-east-1 --speedup-factor 5"` | Extra arguments for the Traffic Replayer command, including options for auth headers and other parameters specified by the [Traffic Replayer](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java). | + + +For arguments available in `captureProxyExtraArgs`, see the `@Parameter` fields in [`CaptureProxy.java`](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/CaptureProxy.java). For `trafficReplayerExtraArgs`, see the `@Parameter` fields in [TrafficReplayer.java](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java). + + +## Cluster authentication options + +Both the source and target cluster can use no authentication, authentication limited to VPC, basic authentication with a username and password, or AWS Signature Version 4 scoped to a user or role. + +### No authentication + +```json + "sourceCluster": { + "endpoint": , + "version": "ES 7.10", + "auth": {"type": "none"} + } +``` +{% include copy.html %} + +### Basic authentication + +```json + "sourceCluster": { + "endpoint": , + "version": "ES 7.10", + "auth": { + "type": "basic", + "username": , + "passwordFromSecretArn": + } + } +``` +{% include copy.html %} + +### Signature Version 4 authentication + +```json + "sourceCluster": { + "endpoint": , + "version": "ES 7.10", + "auth": { + "type": "sigv4", + "region": "us-east-1", + "serviceSigningName": "es" + } + } +``` +{% include copy.html %} + +The `serviceSigningName` can be `es` for an Elasticsearch or OpenSearch domain, or `aoss` for an OpenSearch Serverless collection. + +All of these authentication options apply to both source and target clusters. + +## Network configuration + +The migration tooling expects the source cluster, target cluster, and migration resources to exist in the same VPC. If this is not the case, manual networking setup outside of this documentation is likely required. diff --git a/_migration-assistant/deploying-migration-assistant/getting-started-data-migration.md b/_migration-assistant/deploying-migration-assistant/getting-started-data-migration.md new file mode 100644 index 0000000000..f260a28701 --- /dev/null +++ b/_migration-assistant/deploying-migration-assistant/getting-started-data-migration.md @@ -0,0 +1,355 @@ +--- +layout: default +title: Getting started with data migration +parent: Deploying Migration Assistant +nav_order: 10 +redirect_from: + - /upgrade-to/upgrade-to/ + - /upgrade-to/snapshot-migrate/ + - /migration-assistant/getting-started-with-data-migration/ +--- + +# Getting started with data migration + +This quickstart outlines how to deploy Migration Assistant for OpenSearch and execute an existing data migration using `Reindex-from-Snapshot` (RFS). It uses AWS for illustrative purposes. However, the steps can be modified for use with other cloud providers. + +## Prerequisites and assumptions + +Before using this quickstart, make sure you fulfill the following prerequisites: + +* Verify that your migration path [is supported]({{site.url}}{{site.baseurl}}/migration-assistant/is-migration-assistant-right-for-you/#migration-paths). Note that we test with the exact versions specified, but you should be able to migrate data on alternative minor versions as long as the major version is supported. +* The source cluster must be deployed Amazon Simple Storage Service (Amazon S3) plugin. +* The target cluster must be deployed. + +The steps in this guide assume the following: + +* In this guide, a snapshot will be taken and stored in Amazon S3; the following assumptions are made about this snapshot: + * The `_source` flag is enabled on all indexes to be migrated. + * The snapshot includes the global cluster state (`include_global_state` is `true`). + * Shard sizes of up to approximately 80 GB are supported. Larger shards cannot be migrated. If this presents challenges for your migration, contact the [migration team](https://opensearch.slack.com/archives/C054JQ6UJFK). +* Migration Assistant will be installed in the same AWS Region and have access to both the source snapshot and target cluster. + +--- + +## Step 1: Install Bootstrap on an Amazon EC2 instance (~10 minutes) + +To begin your migration, use the following steps to install a `bootstrap` box on an Amazon Elastic Compute Cloud (Amazon EC2) instance. The instance uses AWS CloudFormation to create and manage the stack. + +1. Log in to the target AWS account in which you want to deploy Migration Assistant. +2. From the browser where you are logged in to your target AWS account, right-click [here](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?templateURL=https://solutions-reference.s3.amazonaws.com/migration-assistant-for-amazon-opensearch-service/latest/migration-assistant-for-amazon-opensearch-service.template&redirectId=SolutionWeb) to load the CloudFormation template from a new browser tab. +3. Follow the CloudFormation stack wizard: + * **Stack Name:** `MigrationBootstrap` + * **Stage Name:** `dev` + * Choose **Next** after each step > **Acknowledge** > **Submit**. +4. Verify that the Bootstrap stack exists and is set to `CREATE_COMPLETE`. This process takes around 10 minutes to complete. + +--- + +## Step 2: Set up Bootstrap instance access (~5 minutes) + +Use the following steps to set up Bootstrap instance access: + +1. After deployment, find the EC2 instance ID for the `bootstrap-dev-instance`. +2. Create an AWS Identity and Access Management (IAM) policy using the following snippet, replacing ``, ``, ``, and `` with your information: + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": "ssm:StartSession", + "Resource": [ + "arn:aws:ec2:::instance/", + "arn:aws:ssm:::document/BootstrapShellDoc--" + ] + } + ] + } + ``` + {% include copy.html %} + +3. Name the policy, for example, `SSM-OSMigrationBootstrapAccess`, and then create the policy by selecting **Create policy**. + +--- + +## Step 3: Log in to Bootstrap and building Migration Assistant (~15 minutes) + +Next, log in to Bootstrap and build Migration Assistant using the following steps. + +### Prerequisites + +To use these steps, make sure you fulfill the following prerequisites: + +* The AWS Command Line Interface (AWS CLI) and AWS Session Manager plugin are installed on your instance. +* The AWS credentials are configured (`aws configure`) for your instance. + +### Steps + +1. Load AWS credentials into your terminal. +2. Log in to the instance using the following command, replacing `` and `` with your instance ID and Region: + + ```bash + aws ssm start-session --document-name BootstrapShellDoc-- --target --region [--profile ] + ``` + {% include copy.html %} + +3. Once logged in, run the following command from the shell of the Bootstrap instance in the `/opensearch-migrations` directory: + + ```bash + ./initBootstrap.sh && cd deployment/cdk/opensearch-service-migration + ``` + {% include copy.html %} + +4. After a successful build, note the path for infrastructure deployment, which will be used in the next step. + +--- + +## Step 4: Configure and deploy RFS (~20 minutes) + +Use the following steps to configure and deploy RFS: + +1. Add the target cluster password to AWS Secrets Manager as an unstructured string. Be sure to copy the secret Amazon Resource Name (ARN) for use during deployment. +2. From the same shell as the Bootstrap instance, modify the `cdk.context.json` file located in the `/opensearch-migrations/deployment/cdk/opensearch-service-migration` directory: + + ```json + { + "migration-assistant": { + "vpcId": "", + "targetCluster": { + "endpoint": "", + "auth": { + "type": "basic", + "username": "", + "passwordFromSecretArn": "" + } + }, + "sourceCluster": { + "endpoint": "", + "auth": { + "type": "basic", + "username": "", + "passwordFromSecretArn": "" + } + }, + "reindexFromSnapshotExtraArgs": "", + "stage": "dev", + "otelCollectorEnabled": true, + "migrationConsoleServiceEnabled": true, + "reindexFromSnapshotServiceEnabled": true, + "migrationAssistanceEnabled": true + } + } + ``` + {% include copy.html %} + + The source and target cluster authorization can be configured to have no authorization, `basic` with a username and password, or `sigv4`. + +3. Bootstrap the account with the following command: + + ```bash + cdk bootstrap --c contextId=migration-assistant --require-approval never + ``` + {% include copy.html %} + +4. Deploy the stacks: + + ```bash + cdk deploy "*" --c contextId=migration-assistant --require-approval never --concurrency 5 + ``` + {% include copy.html %} + +5. Verify that all CloudFormation stacks were installed successfully. + +### RFS parameters + +If you're creating a snapshot using migration tooling, these parameters are automatically configured. If you're using an existing snapshot, modify the `reindexFromSnapshotExtraArgs` setting with the following values: + + ```bash + --s3-repo-uri s3:/// --s3-region --snapshot-name + ``` + +You will also need to give the `migrationconsole` and `reindexFromSnapshot` TaskRoles permissions to the S3 bucket. + +--- + +## Step 5: Deploy Migration Assistant + +To deploy Migration Assistant, use the following steps: + +1. Bootstrap the account: + + ```bash + cdk bootstrap --c contextId=migration-assistant --require-approval never --concurrency 5 + ``` + {% include copy.html %} + +2. Deploy the stacks when `cdk.context.json` is fully configured: + + ```bash + cdk deploy "*" --c contextId=migration-assistant --require-approval never --concurrency 3 + ``` + {% include copy.html %} + +These commands deploy the following stacks: + +* Migration Assistant network stack +* `Reindex-from-snapshot` stack +* Migration console stack + +--- + +## Step 6: Access the migration console + +Run the following command to access the migration console: + +```bash +./accessContainer.sh migration-console dev +``` +{% include copy.html %} + + +`accessContainer.sh` is located in `/opensearch-migrations/deployment/cdk/opensearch-service-migration/` on the Bootstrap instance. To learn more, see [Accessing the migration console]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/migrating-metadata/). +{: .note} + +--- + +## Step 7: Verify the connection to the source and target clusters + +To verify the connection to the clusters, run the following command: + +```bash +console clusters connection-check +``` +{% include copy.html %} + +You should receive the following output: + +```bash +* **Source Cluster:** Successfully connected! +* **Target Cluster:** Successfully connected! +``` + +To learn more about migration console commands, see [Migration commands]. + +--- + +## Step 8: Create a snapshot + +Run the following command to initiate snapshot creation from the source cluster: + +```bash +console snapshot create [...] +``` +{% include copy.html %} + +To check the snapshot creation status, run the following command: + +```bash +console snapshot status [...] +``` +{% include copy.html %} + +To learn more information about the snapshot, run the following command: + +```bash +console snapshot status --deep-check [...] +``` +{% include copy.html %} + +Wait for snapshot creation to complete before moving to step 9. + +To learn more about snapshot creation, see [Snapshot Creation]. + +--- + +## Step 9: Migrate metadata + +Run the following command to migrate metadata: + +```bash +console metadata migrate [...] +``` +{% include copy.html %} + +For more information, see [Migrating metadata]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/migrating-metadata/). + +--- + +## Step 10: Migrate documents with RFS + +You can now use RFS to migrate documents from your original cluster: + +1. To start the migration from RFS, start a `backfill` using the following command: + + ```bash + console backfill start + ``` + {% include copy.html %} + +2. _(Optional)_ To speed up the migration, increase the number of documents processed at a simultaneously by using the following command: + + ```bash + console backfill scale + ``` + {% include copy.html %} + +3. To check the status of the documentation backfill, use the following command: + + ```bash + console backfill status + ``` + {% include copy.html %} + +4. If you need to stop the backfill process, use the following command: + + ```bash + console backfill stop + ``` + {% include copy.html %} + +For more information, see [Backfill]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/backfill/). + +--- + +## Step 11: Backfill monitoring + +Use the following command for detailed monitoring of the backfill process: + +```bash +console backfill status --deep-check +``` +{% include copy.html %} + +You should receive the following output: + +```json +BackfillStatus.RUNNING +Running=9 +Pending=1 +Desired=10 +Shards total: 62 +Shards completed: 46 +Shards incomplete: 16 +Shards in progress: 11 +Shards unclaimed: 5 +``` + +Logs and metrics are available in Amazon CloudWatch in the `OpenSearchMigrations` log group. + +--- + +## Step 12: Verify that all documents were migrated + +Use the following query in CloudWatch Logs Insights to identify failed documents: + +```bash +fields @message +| filter @message like "Bulk request succeeded, but some operations failed." +| sort @timestamp desc +| limit 10000 +``` +{% include copy.html %} + +If any failed documents are identified, you can index the failed documents directly as opposed to using RFS. + diff --git a/_migrations/deploying-migration-assistant/iam-and-security-groups-for-existing-clusters.md b/_migration-assistant/deploying-migration-assistant/iam-and-security-groups-for-existing-clusters.md similarity index 61% rename from _migrations/deploying-migration-assistant/iam-and-security-groups-for-existing-clusters.md rename to _migration-assistant/deploying-migration-assistant/iam-and-security-groups-for-existing-clusters.md index 46f1d7e11e..331b99e1fa 100644 --- a/_migrations/deploying-migration-assistant/iam-and-security-groups-for-existing-clusters.md +++ b/_migration-assistant/deploying-migration-assistant/iam-and-security-groups-for-existing-clusters.md @@ -2,40 +2,42 @@ layout: default title: IAM and security groups for existing clusters nav_order: 20 -parent: Deploying migration assistant +parent: Deploying Migration Assistant --- # IAM and security groups for existing clusters -This page outlines scenarios for using the migration tools with existing clusters, including any necessary configuration changes to ensure proper communication between them. +This page outlines security scenarios for using the migration tools with existing clusters, including any necessary configuration changes to ensure proper communication between them. -## Importing an OpenSearch Service or OpenSearch Serverless Target Cluster +## Importing an Amazon OpenSearch Service or Amazon OpenSearch Serverless target cluster + +Use the following scenarios for Amazon OpenSearch Service or Amazon OpenSearch Serverless target clusters. ### OpenSearch Service For an OpenSearch Domain, two main configurations are typically required to ensure proper functioning of the migration solution: -1. **Security Group Configuration**: - The Domain should have a security group that allows communication from the applicable Migration services (Traffic Replayer, Migration Console, Reindex-from-Snapshot). The CDK will automatically create an `osClusterAccessSG` security group, which is applied to the Migration services. The user should then add this security group to their existing Domain to allow access. +1. **Security Group Configuration** + + The domain should have a security group that allows communication from the applicable migration services (Traffic Replayer, Migration Console, `Reindex-from-Snapshot`). The CDK automatically creates an `osClusterAccessSG` security group, which is applied to the migration services. The user should then add this security group to their existing domain to allow access. -2. **Access Policy Configuration**: - The Domain’s access policy should either: - - Be an open access policy that allows all access, or - - Be configured to allow at least the IAM task roles for the applicable Migration services (Traffic Replayer, Migration Console, Reindex-from-Snapshot) to access the Domain. +2. **Access Policy Configuration** should be one of the following: + - An open access policy that allows all access. + - Configured to allow at least the AWS Identity and Access Management (IAM) task roles for the applicable migration services (Traffic Replayer, Migration Console, `Reindex-from-Snapshot`) to access the domain. ### OpenSearch Serverless -For an OpenSearch Serverless Collection, you will need to configure both Network and Data Access policies: +For an OpenSearch Serverless Collection, you will need to configure both network and data access policies: 1. **Network Policy Configuration**: The Collection should have a network policy that uses the `VPC` access type. This requires creating a VPC endpoint on the VPC used for the solution. The VPC endpoint should be configured for the private subnets of the VPC and should attach the `osClusterAccessSG` security group. 2. **Data Access Policy Configuration**: - The data access policy should grant permission to perform all [index operations](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html#serverless-data-supported-permissions) ↗ (`aoss:*`) for all indexes in the Collection. The IAM task roles of the applicable Migration services (Traffic Replayer, Migration Console, Reindex-from-Snapshot) should be used as the principals for this data access policy. + The data access policy should grant permission to perform all [index operations](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html#serverless-data-supported-permissions) (`aoss:*`) for all indexes in the Collection. The IAM task roles of the applicable Migration services (Traffic Replayer, migration console, `Reindex-from-Snapshot`) should be used as the principals for this data access policy. ## Capture Proxy on Coordinator Nodes of Source Cluster -Although the CDK does not automatically set up the Capture Proxy on source cluster nodes (except in the demo solution), the Capture Proxy instances must communicate with the resources deployed by the CDK (e.g., Kafka). This section outlines the necessary steps. +Although the CDK does not automatically set up the Capture Proxy on source cluster nodes (except in the demo solution), the Capture Proxy instances must communicate with the resources deployed by the CDK, such as Kafka. This section outlines the necessary steps to set up communication. Before [setting up Capture Proxy instances](https://github.com/opensearch-project/opensearch-migrations/tree/main/TrafficCapture/trafficCaptureProxyServer#installing-capture-proxy-on-coordinator-nodes) on the source cluster, ensure the following configurations are in place: @@ -66,7 +68,3 @@ Before [setting up Capture Proxy instances](https://github.com/opensearch-projec ] } ``` - -## Related Links - -- [OpenSearch Traffic Capture Setup](https://github.com/opensearch-project/opensearch-migrations/tree/main/TrafficCapture/trafficCaptureProxyServer#installing-capture-proxy-on-coordinator-nodes) ↗ \ No newline at end of file diff --git a/_migration-assistant/deploying-migration-assistant/index.md b/_migration-assistant/deploying-migration-assistant/index.md new file mode 100644 index 0000000000..1c559a81b1 --- /dev/null +++ b/_migration-assistant/deploying-migration-assistant/index.md @@ -0,0 +1,13 @@ +--- +layout: default +title: Deploying Migration Assistant +nav_order: 15 +has_children: true +permalink: /deploying-migration-assistant/ +redirect-from: + - /deploying-migration-assistant/index/ +--- + +# Deploying Migration Assistant + +This section provides information about the available options for deploying Migration Assistant. diff --git a/_migrations/index.md b/_migration-assistant/index.md similarity index 93% rename from _migrations/index.md rename to _migration-assistant/index.md index e3b2657e1a..8e8422f0c4 100644 --- a/_migrations/index.md +++ b/_migration-assistant/index.md @@ -5,6 +5,11 @@ nav_order: 1 has_children: false nav_exclude: true has_toc: false +permalink: /migration-assistant/ +redirect_from: + - /migration-assistant/index/ + - /upgrade-to/index/ + - /upgrade-to/ --- # Migration Assistant for OpenSearch @@ -45,7 +50,7 @@ Acting as a traffic simulation tool, the Traffic Replayer replays recorded reque The Metadata migration tool integrated into the Migration CLI can be used independently to migrate cluster metadata, including index mappings, index configuration settings, templates, component templates, and aliases. -### reindex-from-snapshot +### Reindex-from-Snapshot `Reindex-from-Snapshot` (RFS) reindexes data from an existing snapshot. Workers on Amazon Elastic Container Service (Amazon ECS) coordinate the migration of documents from an existing snapshot, reindexing the documents in parallel to a target cluster. @@ -53,18 +58,18 @@ The Metadata migration tool integrated into the Migration CLI can be used indepe The destination cluster for migration or comparison in an A/B test. -### Architecture overview +## Architecture overview The Migration Assistant architecture is based on the use of an AWS Cloud infrastructure, but most tools are designed to be cloud independent. A local containerized version of this solution is also available. The design deployed in AWS is as follows: -![Migration architecture overview]({{site.url}}{{site.baseurl}}/images/migrations/migration-architecture-overview.svg) +![Migration architecture overview]({{site.url}}{{site.baseurl}}/images/migrations/migrations-architecture-overview.png) 1. Client traffic is directed to the existing cluster. 2. An Application Load Balancer with capture proxies relays traffic to a source while replicating data to Amazon Managed Streaming for Apache Kafka (Amazon MSK). 3. Using the migration console, you can initiate metadata migration to establish indexes, templates, component templates, and aliases on the target cluster. 4. With continuous traffic capture in place, you can use a `reindex-from-snapshot` process to capture data from your current index. -4. Once `reindex-from-snapshot` is complete, captured traffic is replayed from Amazon MSK to the target cluster by the traffic replayer. +4. Once `Reindex-from-Snapshot` is complete, captured traffic is replayed from Amazon MSK to the target cluster by the traffic replayer. 5. Performance and behavior of traffic sent to the source and target clusters are compared by reviewing logs and metrics. 6. After confirming that the target cluster's functionality meets expectations, clients are redirected to the new target. diff --git a/_migrations/is-migration-assistant-right-for-you.md b/_migration-assistant/is-migration-assistant-right-for-you.md similarity index 99% rename from _migrations/is-migration-assistant-right-for-you.md rename to _migration-assistant/is-migration-assistant-right-for-you.md index 6a09e44206..073c2b6cd7 100644 --- a/_migrations/is-migration-assistant-right-for-you.md +++ b/_migration-assistant/is-migration-assistant-right-for-you.md @@ -30,6 +30,7 @@ There are also tools available for migrating cluster configuration, templates, a {: .note} ### Supported source and target platforms + * Self-managed (hosted by cloud provider or on-premises) * AWS OpenSearch diff --git a/_migration-assistant/migration-console/accessing-the-migration-console.md b/_migration-assistant/migration-console/accessing-the-migration-console.md new file mode 100644 index 0000000000..ea66f5c04c --- /dev/null +++ b/_migration-assistant/migration-console/accessing-the-migration-console.md @@ -0,0 +1,35 @@ +--- +layout: default +title: Accessing the migration console +nav_order: 35 +parent: Migration console +--- + +# Accessing the migration console + +The Bootstrap box deployed through Migration Assistant contains a script that simplifies access to the migration console through that instance. + +To access the migration console, use the following commands: + +```shell +export STAGE=dev +export AWS_REGION=us-west-2 +/opensearch-migrations/deployment/cdk/opensearch-service-migration/accessContainer.sh migration-console ${STAGE} ${AWS_REGION} +``` +{% include copy.html %} + +When opening the console a message will appear above the command prompt, `Welcome to the Migration Assistant Console`. + +On a machine with the [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and the [AWS Session Manager plugin](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html), you can directly connect to the migration console. Ensure that you've run `aws configure` with credentials that have access to the environment. + +Use the following commands: + +```shell +export STAGE=dev +export SERVICE_NAME=migration-console +export TASK_ARN=$(aws ecs list-tasks --cluster migration-${STAGE}-ecs-cluster --family "migration-${STAGE}-${SERVICE_NAME}" | jq --raw-output '.taskArns[0]') +aws ecs execute-command --cluster "migration-${STAGE}-ecs-cluster" --task "${TASK_ARN}" --container "${SERVICE_NAME}" --interactive --command "/bin/bash" +``` +{% include copy.html %} + +Typically, `STAGE` is equivalent to a standard `dev` environment, but this may vary based on what the user specified during deployment. \ No newline at end of file diff --git a/_migration-assistant/migration-console/index.md b/_migration-assistant/migration-console/index.md new file mode 100644 index 0000000000..eab3001868 --- /dev/null +++ b/_migration-assistant/migration-console/index.md @@ -0,0 +1,17 @@ +--- +layout: default +title: Migration console +nav_order: 30 +has_children: true +permalink: /migration-console/ +has_toc: false +redirect_from: + - /migration-console/index/ +--- + +# Migration console + +The Migrations Assistant deployment includes an Amazon Elastic Container Service (Amazon ECS) task that hosts tools that run different phases of the migration and check the progress or results of the migration. This ECS task is called the **migration console**. The migration console is a command line interface used to interact with the deployed components of the solution. + +This section provides information about how to access the migration console and what commands are supported. + diff --git a/_migration-assistant/migration-console/migration-console-commands-references.md b/_migration-assistant/migration-console/migration-console-commands-references.md new file mode 100644 index 0000000000..21d793b3f3 --- /dev/null +++ b/_migration-assistant/migration-console/migration-console-commands-references.md @@ -0,0 +1,131 @@ +--- +layout: default +title: Command reference +nav_order: 40 +parent: Migration console +--- + +# Migration console command reference + +Migration console commands follow this syntax: `console [component] [action]`. The components include `clusters`, `backfill`, `snapshot`, `metadata`, and `replay`. The console is configured with a registry of the deployed services and the source and target cluster, generated from the `cdk.context.json` values. + +## Commonly used commands + +The exact commands used will depend heavily on use-case and goals, but the following are a series of common commands with a quick description of what they do. + +### Check connection + +Reports whether both the source and target clusters can be reached and provides their versions. + +```sh +console clusters connection-check +``` +{% include copy.html %} + +### Run `cat-indices` + +Runs the `cat-indices` API on the cluster. + +```sh +console clusters cat-indices +``` +{% include copy.html %} + +### Create a snapshot + +Creates a snapshot of the source cluster and stores it in a preconfigured Amazon Simple Storage Service (Amazon S3) bucket. + +```sh +console snapshot create +``` +{% include copy.html %} + +## Check snapshot status + +Runs a detailed check on the snapshot creation status, including estimated completion time: + +```sh +console snapshot status --deep-check +``` +{% include copy.html %} + +## Evaluate metadata + +Performs a dry run of metadata migration, showing which indexes, templates, and other objects will be migrated to the target cluster. + +```sh +console metadata evaluate +``` +{% include copy.html %} + +## Migrate metadata + +Migrates the metadata from the source cluster to the target cluster. + +```sh +console metadata migrate +``` +{% include copy.html %} + +## Start a backfill + +If `Reindex-From-Snapshot` (RFS) is enabled, this command starts an instance of the service to begin moving documents to the target cluster: + +There are similar `scale UNITS` and `stop` commands to change the number of active instances for RFS. + + +```sh +console backfill start +``` +{% include copy.html %} + +## Check backfill status + +Gets the current status of the backfill migration, including the number of operating instances and the progress of the shards. + + +## Start Traffic Replayer + +If Traffic Replayer is enabled, this command starts an instance of Traffic Replayer to begin replaying traffic against the target cluster. +The `stop` command stops all active instances. + +```sh +console replay start +``` +{% include copy.html %} + +## Read logs + +Reads any logs that exist when running Traffic Replayer. Use tab completion on the path to fill in the available `NODE_IDs` and, if applicable, log file names. The tuple logs roll over at a certain size threshold, so there may be many files named with timestamps. The `jq` command pretty-prints each line of the tuple output before writing it to file. + +```sh +console tuples show --in /shared-logs-output/traffic-replayer-default/[NODE_ID]/tuples/console.log | jq > readable_tuples.json +``` +{% include copy.html %} + +## Help option + +All commands and options can be explored within the tool itself by using the `--help` option, either for the entire `console` application or for individual components (for example, `console backfill --help`). For example: + +```bash +$ console --help +Usage: console [OPTIONS] COMMAND [ARGS]... + +Options: + --config-file TEXT Path to config file + --json + -v, --verbose Verbosity level. Default is warn, -v is info, -vv is + debug. + --help Show this message and exit. + +Commands: + backfill Commands related to controlling the configured backfill... + clusters Commands to interact with source and target clusters + completion Generate shell completion script and instructions for setup. + kafka All actions related to Kafka operations + metadata Commands related to migrating metadata to the target cluster. + metrics Commands related to checking metrics emitted by the capture... + replay Commands related to controlling the replayer. + snapshot Commands to create and check status of snapshots of the... + tuples All commands related to tuples. +``` diff --git a/_migration-assistant/migration-phases/backfill.md b/_migration-assistant/migration-phases/backfill.md new file mode 100644 index 0000000000..d2ff7cd873 --- /dev/null +++ b/_migration-assistant/migration-phases/backfill.md @@ -0,0 +1,181 @@ +--- +layout: default +title: Backfill +nav_order: 90 +parent: Migration phases +--- + +# Backfill + +After the [metadata]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/migrating-metadata/) for your cluster has been migrated, you can use capture proxy data replication and snapshots to backfill your data into the next cluster. + +## Capture proxy data replication + +If you're interested in capturing live traffic during your migration, Migration Assistant includes an Application Load Balancer for routing traffic to the capture proxy and the target cluster. Upstream client traffic must be routed through the capture proxy in order to replay the requests later. Before using the capture proxy, remember the following: + +* The layer upstream from the Application Load Balancer is compatible with the certificate on the Application Load Balancer listener, whether it's for clients or a Network Load Balancer. The `albAcmCertArn` in the `cdk.context.json` may need to be provided to ensure that clients trust the Application Load Balancer certificate. +* If a Network Load Balancer is used directly upstream of the Application Load Balancer, it must use a TLS listener. +* Upstream resources and security groups must allow network access to the Migration Assistant Application Load Balancer. + +To set up the capture proxy, go to the AWS Management Console and navigate to **EC2 > Load Balancers > Migration Assistant Application Load Balancer**. Copy the Application Load Balancer URL. With the URL copied, you can use one of the following options. + + +### If you are using **Network Load Balancer → Application Load Balancer → Cluster** + +1. Ensure that ingress is provided directly to the Application Load Balancer for the capture proxy. +2. Create a target group for the Migration Assistant Application Load Balancer on port `9200`, and set the health check to `HTTPS`. +3. Associate this target group with your existing Network Load Balancer on a new listener for testing. +4. Verify that the health check is successful, and perform smoke testing with some clients through the new listener port. +5. Once you are ready to migrate all clients, detach the Migration Assistant Application Load Balancer target group from the testing Network Load Balancer listener and modify the existing Network Load Balancer listener to direct traffic to this target group. +6. Now client requests will be routed through the proxy (once they establish a new connection). Verify the application metrics. + +### If you are using **Network Load Balancer → Cluster** + +If you do not want to modify application logic, add an Application Load Balancer in front of your cluster and follow the **Network Load Balancer → Application Load Balancer → Cluster** steps. Otherwise: + +1. Create a target group for the Application Load Balancer on port `9200` and set the health check to `HTTPS`. +2. Associate this target group with your existing Network Load Balancer on a new listener. +3. Verify that the health check is successful, and perform smoke testing with some clients through the new listener port. +4. Once you are ready to migrate all clients, deploy a change so that clients hit the new listener. + + +### If you are **not using an Network Load Balancer** + +If you're only using backfill as your migration technique, make a client/DNS change to route clients to the Migration Assistant Application Load Balancer on port `9200`. + + +### Kafka connection + +After you have routed the client based on your use case, test adding records against HTTP requests using the following steps: + +In the migration console, run the following command: + + ```bash + console kafka describe-topic-records + ``` + {% include copy.html %} + + Note the records in the logging topic. + +After a short period, execute the same command again and compare the increased number of records against the expected HTTP requests. + + +## Creating a snapshot + +Create a snapshot for your backfill using the following command: + +```bash +console snapshot create +``` +{% include copy.html %} + +To check the progress of your snapshot, use the following command: + +```bash +console snapshot status --deep-check +``` +{% include copy.html %} + +Depending on the size of the data in the source cluster and the bandwidth allocated for snapshots, the process can take some time. Adjust the maximum rate at which the source cluster's nodes create the snapshot using the `--max-snapshot-rate-mb-per-node` option. Increasing the snapshot rate will consume more node resources, which may affect the cluster's ability to handle normal traffic. + +## Backfilling documents to the source cluster + +From the snapshot you created of your source cluster, you can begin backfilling documents into the target cluster. Once you have started this process, a fleet of workers will spin up to read the snapshot and reindex documents into the target cluster. This fleet of workers can be scaled to increased the speed at which documents are reindexed into the target cluster. + +### Checking the starting state of the clusters + +You can check the indexes and document counts of the source and target clusters by running the `cat-indices` command. This can be used to monitor the difference between the source and target for any migration scenario. Check the indexes of both clusters using the following command: + +```shell +console clusters cat-indices +``` +{% include copy.html %} + +You should receive the following response: + +```shell +SOURCE CLUSTER +health status index uuid pri rep docs.count docs.deleted store.size pri.store.size +green open my-index WJPVdHNyQ1KMKol84Cy72Q 1 0 8 0 44.7kb 44.7kb + +TARGET CLUSTER +health status index uuid pri rep docs.count docs.deleted store.size pri.store.size +green open .opendistro_security N3uy88FGT9eAO7FTbLqqqA 1 0 10 0 78.3kb 78.3kb +``` + +### Starting the backfill + +Use the following command to start the backfill and deploy the workers: + +```shell +console backfill start +``` +{% include copy.html %} + +You should receive a response similar to the following: + +```shell +BackfillStatus.RUNNING +Running=1 +Pending=0 +Desired=1 +Shards total: 48 +Shards completed: 48 +Shards incomplete: 0 +Shards in progress: 0 +Shards unclaimed: 0 +``` + +The status will be `Running` even if all the shards have been migrated. + +### Scaling up the fleet + +To speed up the transfer, you can scale the number of workers. It may take a few minutes for these additional workers to come online. The following command will update the worker fleet to a size of 10: + +```shell +console backfill scale 5 +``` +{% include copy.html %} + +We recommend slowly scaling up the fleet while monitoring the health metrics of the target cluster to avoid over-saturating it. [Amazon OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/monitoring.html) provide a number of metrics and logs that can provide this insight. + +### Stopping the migration + +Backfill requires manually stopping the fleet. Once all the data has been migrated, you can shut down the fleet and all its workers using the following command: +Backfill requires manually stopping the fleet. Once all the data has been migrated, you can shut down the fleet and all its workers using the following command: +```shell +console backfill stop +``` + +### Amazon CloudWatch metrics and dashboard + +Migration Assistant creates an Amazon CloudWatch dashboard that you can use to visualize the health and performance of the backfill process. It combines the metrics for the backfill workers and, for those migrating to Amazon OpenSearch Service, the target cluster. + +You can find the backfill dashboard in the CloudWatch console based on the AWS Region in which you have deployed Migration Assistant. The metric graphs for your target cluster will be blank until you select the OpenSearch domain you're migrating to from the dropdown menu at the top of the dashboard. + +## Validating the backfill + +After the backfill is complete and the workers have stopped, examine the contents of your cluster using the [Refresh API](https://opensearch.org/docs/latest/api-reference/index-apis/refresh/) and the [Flush API](https://opensearch.org/docs/latest/api-reference/index-apis/flush/). The following example uses the console CLI with the Refresh API to check the backfill status: + +```shell +console clusters cat-indices --refresh +``` +{% include copy.html %} + +This will display the number of documents in each of the indexes in the target cluster, as shown in the following example response: + +```shell +SOURCE CLUSTER +health status index uuid pri rep docs.count docs.deleted store.size pri.store.size +green open my-index -DqPQDrATw25hhe5Ss34bQ 1 0 3 0 12.7kb 12.7kb + +TARGET CLUSTER +health status index uuid pri rep docs.count docs.deleted store.size pri.store.size +green open .opensearch-observability 8HOComzdSlSWCwqWIOGRbQ 1 1 0 0 416b 208b +green open .plugins-ml-config 9tld-PCJToSUsMiyDhlyhQ 5 1 1 0 9.5kb 4.7kb +green open my-index bGfGtYoeSU6U6p8leR5NAQ 1 0 3 0 5.5kb 5.5kb +green open .migrations_working_state lopd47ReQ9OEhw4ZuJGZOg 1 1 2 0 18.6kb 6.4kb +green open .kibana_1 +``` + +You can run additional queries against the target cluster to mimic your production workflow and closely examine the results. diff --git a/_migration-assistant/migration-phases/index.md b/_migration-assistant/migration-phases/index.md new file mode 100644 index 0000000000..63e4b574e5 --- /dev/null +++ b/_migration-assistant/migration-phases/index.md @@ -0,0 +1,23 @@ +--- +layout: default +title: Migration phases +nav_order: 50 +has_children: true +has_toc: false +permalink: /migration-phases/ +redirect_from: + - /migration-phases/index/ +--- + +# Migration phases + +This page details how to conduct a migration with Migration Assistant. It shows you how to [plan for your migration]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/planning-your-migration/index/) and encompasses a variety of migration scenarios, including: + +- [**Metadata migration**]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/migrating-metadata/): Migrating cluster metadata, such as index settings, aliases, and templates. +- [**Backfill migration**]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/backfill/): Migrating existing or historical data from a source to a target cluster. +- [**Live traffic migration**]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/using-traffic-replayer/): Replicating live ongoing traffic from [a source cluster]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/switching-traffic-from-the-source-cluster/) to a target cluster. + + + + + diff --git a/_migration-assistant/migration-phases/live-traffic-migration/index.md b/_migration-assistant/migration-phases/live-traffic-migration/index.md new file mode 100644 index 0000000000..f9e3d9f71f --- /dev/null +++ b/_migration-assistant/migration-phases/live-traffic-migration/index.md @@ -0,0 +1,18 @@ +--- +layout: default +title: Live traffic migration +nav_order: 99 +parent: Migration phases +has_toc: false +has_children: true +--- + +# Live traffic migration + +Live traffic migration intercepts HTTP requests to a source cluster and stores them in a durable stream before forwarding them to the source cluster. The stored requests are then duplicated and replayed to the target cluster. This process synchronizes the source and target clusters while highlighting behavioral and performance differences between them. Kafka is used to manage the data flow and reconstruct HTTP requests. You can monitor the replication process through CloudWatch metrics and the [migration console]({{site.url}}{{site.baseurl}}/migration-console/), which provides results in JSON format for analysis. + +To start with live traffic migration, use the following steps: + +1. [Using Traffic Replayer]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/using-traffic-replayer/) +2. [Switching traffic from the source cluster]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/switching-traffic-from-the-source-cluster/) + diff --git a/_migration-assistant/migration-phases/live-traffic-migration/switching-traffic-from-the-source-cluster.md b/_migration-assistant/migration-phases/live-traffic-migration/switching-traffic-from-the-source-cluster.md new file mode 100644 index 0000000000..e9e477654a --- /dev/null +++ b/_migration-assistant/migration-phases/live-traffic-migration/switching-traffic-from-the-source-cluster.md @@ -0,0 +1,55 @@ +--- +layout: default +title: Switching traffic from the source cluster +nav_order: 110 +grand_parent: Migration phases +parent: Live traffic migration +redirect_from: + - /migration-assistant/migration-phases/switching-traffic-from-the-source-cluster/ +--- + +# Switching traffic from the source cluster + +After the source and target clusters are synchronized, traffic needs to be switched to the target cluster so that the source cluster can be taken offline. + +## Assumptions + +This page assumes that the following has occurred before making the switch: + +- All client traffic is being routed through a switchover listener in the [MigrationAssistant Application Load Balancer]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/backfill/). +- Client traffic has been verified as compatible with the target cluster. +- The target cluster is in a good state to accept client traffic. +- The target proxy service is deployed. + +## Switching traffic + +Use the following steps to switch traffic from the source cluster to the target cluster: + +1. In the AWS Management Console, navigate to **ECS** > **Migration Assistant Cluster**. Note the desired count of the capture proxy, which should be greater than 1. + +2. Update the **ECS Service** of the target proxy to be at least as large as the traffic capture proxy. Wait for tasks to start up, and verify that all targets are healthy in the target proxy service's **Load balancer health** section. + +3. Navigate to **EC2** > **Load Balancers** > **Migration Assistant ALB**. + +4. Navigate to **ALB Metrics** and examine any useful information, specifically looking at **Active Connection Count** and **New Connection Count**. Note any large discrepancies, which can indicate reused connections affecting traffic switchover. + +5. Navigate to **Capture Proxy Target Group** (`ALBSourceProxy--TG`) > **Monitoring**. + +6. Examine the **Metrics Requests**, **Target (2XX, 3XX, 4XX, 5XX)**, and **Target Response Time** metrics. Verify that this appears as expected and includes all traffic expected to be included in the switchover. Note details that could help identify anomalies during the switchover, including the expected response time and response code rate. + +7. Navigate back to **ALB Metrics** and choose **Target Proxy Target Group** (`ALBTargetProxy--TG`). Verify that all expected targets are healthy and that none are in a draining state. + +8. Navigate back to **ALB Metrics** and to the **Listener** on port `9200`. + +9. Choose the **Default rule** and **Edit**. + +10. Modify the weights of the targets to switch the desired traffic to the target proxy. To perform a full switchover, modify the **Target Proxy** weight to `1` and the **Source Proxy** weight to `0`. + +11. Choose **Save Changes**. + +12. Navigate to both **SourceProxy** and **TargetProxy TG Monitoring** metrics and verify that traffic is switching over as expected. If connections are being reused by clients, perform any necessary actions to terminate them. Monitor these metrics until **SourceProxy TG** shows 0 requests when all clients have switched over. + + +## Fallback + +If you need to fall back to the source cluster at any point during the switchover, revert the **Default rule** so that the Application Load Balancer routes to the **SourceProxy Target Group**. \ No newline at end of file diff --git a/_migration-assistant/migration-phases/live-traffic-migration/using-traffic-replayer.md b/_migration-assistant/migration-phases/live-traffic-migration/using-traffic-replayer.md new file mode 100644 index 0000000000..05c734d7be --- /dev/null +++ b/_migration-assistant/migration-phases/live-traffic-migration/using-traffic-replayer.md @@ -0,0 +1,312 @@ +--- +layout: default +title: Using Traffic Replayer +nav_order: 100 +grand_parent: Migration phases +parent: Live traffic migration +redirect_from: + - /migration-assistant/migration-phases/using-traffic-replayer/ +--- + +# Using Traffic Replayer + +This guide covers how to use Traffic Replayer to replay captured traffic from a source cluster to a target cluster during the migration process. Traffic Replayer allows you to verify that the target cluster can handle requests in the same way as the source cluster and catch up to real-time traffic for a smooth migration. + +## When to run Traffic Replayer + +After deploying Migration Assistant, Traffic Replayer does not run by default. It should be started only after all metadata and documents have been migrated to ensure that recent changes to the source cluster are properly reflected in the target cluster. + +For example, if a document was deleted after a snapshot was taken, starting Traffic Replayer before the document migration is complete may cause the deletion request to execute before the document is added to the target. Running Traffic Replayer after all other migration processes ensures that the target cluster will be consistent with the source cluster. + +## Configuration options + +[Traffic Replayer settings]({{site.url}}{{site.baseurl}}/migration-assistant/deploying-migration-assistant/configuration-options/) are configured during the deployment of Migration Assistant. Make sure to set the authentication mode for Traffic Replayer so that it can properly communicate with the target cluster. + +## Using Traffic Replayer + +To manage Traffic Replayer, use the `console replay` command. The following examples show the available commands. + +### Start Traffic Replayer + +The following command starts Traffic Replayer with the options specified at deployment: + +```bash +console replay start +``` + +When starting Traffic Replayer, you should receive an output similar to the following: + +```bash +root@ip-10-0-2-66:~# console replay start +Replayer started successfully. +Service migration-dev-traffic-replayer-default set to 1 desired count. Currently 0 running and 0 pending. +``` + +## Check the status of Traffic Replayer + +Use the following command to show the status of Traffic Replayer: + +```bash +console replay status +``` + +Replay will return one of the following statuses: + +- `Running` shows how many container instances are actively running. +- `Pending` indicates how many instances are being provisione.d +- `Desired` shows the total number of instances that should be running. + +You should receive an output similar to the following: + +```bash +root@ip-10-0-2-66:~# console replay status +(, 'Running=0\nPending=0\nDesired=0') +``` + +## Stop Traffic Replayer + +The following command stops Traffic Replayer: + +```bash +console replay stop +``` + +You should receive an output similar to the following: + +```bash +root@ip-10-0-2-66:~# console replay stop +Replayer stopped successfully. +Service migration-dev-traffic-replayer-default set to 0 desired count. Currently 0 running and 0 pending. +``` + + + +### Delivery guarantees + +Traffic Replayer retrieves traffic from Kafka and updates its commit cursor after sending requests to the target cluster. This provides an "at least once" delivery guarantee; however, success isn't always guaranteed. Therefore, you should monitor metrics and tuple outputs or perform external validation to ensure that the target cluster is functioning as expected. + +## Time scaling + +Traffic Replayer sends requests in the same order that they were received from each connection to the source. However, relative timing between different connections is not guaranteed. For example: + +- **Scenario**: Two connections exist:one sends a PUT request every minute, and the other sends a GET request every second. +- **Behavior**: Traffic Replayer will maintain the sequence within each connection, but the relative timing between the connections (PUTs and GETs) is not preserved. + +Assume that a source cluster responds to requests (GETs and PUTs) within 100 ms: + +- With a **speedup factor of 1**, the target will experience the same request rates and idle periods as the source. +- With a **speedup factor of 2**, requests will be sent twice as fast, with GETs sent every 500 ms and PUTs every 30 seconds. +- With a **speedup factor of 10**, requests will be sent 10x faster, and as long as the target responds quickly, Traffic Replayer can maintain the pace. + +If the target cannot respond fast enough, Traffic Replayer will wait for the previous request to complete before sending the next one. This may cause delays and affect global relative ordering. + +## Transformations + +During migrations, some requests may need to be transformed between versions. For example, Elasticsearch previously supported multiple type mappings in indexes, but this is no longer the case in OpenSearch. Clients may need to be adjusted accordingly by splitting documents into multiple indexes or transforming request data. + +Traffic Replayer automatically rewrites host and authentication headers, but for more complex transformations, custom transformation rules can be specified using the `--transformer-config` option. For more information, see the [Traffic Replayer README](https://github.com/opensearch-project/opensearch-migrations/blob/c3d25958a44ec2e7505892b4ea30e5fbfad4c71b/TrafficCapture/trafficReplayer/README.md#transformations). + +### Example transformation + +Suppose that a source request contains a `tagToExcise` element that needs to be removed and its children promoted and that the URI path includes `extraThingToRemove`, which should also be removed. The following Jolt script handles this transformation: + +```json +[{ "JsonJoltTransformerProvider": +[ + { + "script": { + "operation": "shift", + "spec": { + "payload": { + "inlinedJsonBody": { + "top": { + "tagToExcise": { + "*": "payload.inlinedJsonBody.top.&" + }, + "*": "payload.inlinedJsonBody.top.&" + }, + "*": "payload.inlinedJsonBody.&" + }, + "*": "payload.&" + }, + "*": "&" + } + } + }, + { + "script": { + "operation": "modify-overwrite-beta", + "spec": { + "URI": "=split('/extraThingToRemove',@(1,&))" + } + } + }, + { + "script": { + "operation": "modify-overwrite-beta", + "spec": { + "URI": "=join('',@(1,&))" + } + } + } +] +}] +``` + +The resulting request sent to the target will appear similar to the following: + +```bash +PUT /oldStyleIndex/moreStuff HTTP/1.0 +host: testhostname + +{"top":{"properties":{"field1":{"type":"text"},"field2":{"type":"keyword"}}}} +``` +{% include copy.html %} + +You can pass Base64-encoded transformation scripts using `--transformer-config-base64`. + +## Result logs + +HTTP transactions from the source capture and those resent to the target cluster are logged in files located at `/shared-logs-output/traffic-replayer-default/*/tuples/tuples.log`. The `/shared-logs-output` directory is shared across containers, including the migration console. You can access these files from the migration console using the same path. Previous runs are also available in a `gzipped` format. + +Each log entry is a newline-delimited JSON object, containing information about the source and target requests/responses along with other transaction details, such as response times. + +These logs contain the contents of all requests, including authorization headers and the contents of all HTTP messages. Ensure that access to the migration environment is restricted, as these logs serve as a source of truth for determining what happened in both the source and target clusters. Response times for the source refer to the amount of time between the proxy sending the end of a request and receiving the response. While response times for the target are recorded in the same manner, keep in mind that the locations of the capture proxy, Traffic Replayer, and target may differ and that these logs do not account for the client's location. +{: .note} + + +### Example log entry + +The following example log entry shows a `/_cat/indices?v` request sent to both the source and target clusters: + +```json +{ + "sourceRequest": { + "Request-URI": "/_cat/indices?v", + "Method": "GET", + "HTTP-Version": "HTTP/1.1", + "Host": "capture-proxy:9200", + "Authorization": "Basic YWRtaW46YWRtaW4=", + "User-Agent": "curl/8.5.0", + "Accept": "*/*", + "body": "" + }, + "sourceResponse": { + "HTTP-Version": {"keepAliveDefault": true}, + "Status-Code": 200, + "Reason-Phrase": "OK", + "response_time_ms": 59, + "content-type": "text/plain; charset=UTF-8", + "content-length": "214", + "body": "aGVhbHRoIHN0YXR1cyBpbmRleCAgICAgICB..." + }, + "targetRequest": { + "Request-URI": "/_cat/indices?v", + "Method": "GET", + "HTTP-Version": "HTTP/1.1", + "Host": "opensearchtarget", + "Authorization": "Basic YWRtaW46bXlTdHJvbmdQYXNzd29yZDEyMyE=", + "User-Agent": "curl/8.5.0", + "Accept": "*/*", + "body": "" + }, + "targetResponses": [{ + "HTTP-Version": {"keepAliveDefault": true}, + "Status-Code": 200, + "Reason-Phrase": "OK", + "response_time_ms": 721, + "content-type": "text/plain; charset=UTF-8", + "content-length": "484", + "body": "aGVhbHRoIHN0YXR1cyBpbmRleCAgICAgICB..." + }], + "connectionId": "0242acfffe13000a-0000000a-00000005-1eb087a9beb83f3e-a32794b4.0", + "numRequests": 1, + "numErrors": 0 +} +``` +{% include copy.html %} + + +### Decoding log content + +The contents of HTTP message bodies are Base64 encoded in order to handle various types of traffic, including compressed data. To view the logs in a more human-readable format, use the console library `tuples show`. Running the script as follows will produce a `readable-tuples.log` in the home directory: + +```shell +console tuples show --in /shared-logs-output/traffic-replayer-default/d3a4b31e1af4/tuples/tuples.log > readable-tuples.log +``` + +The `readable-tuples.log` should appear similar to the following: + +```json +{ + "sourceRequest": { + "Request-URI": "/_cat/indices?v", + "Method": "GET", + "HTTP-Version": "HTTP/1.1", + "Host": "capture-proxy:9200", + "Authorization": "Basic YWRtaW46YWRtaW4=", + "User-Agent": "curl/8.5.0", + "Accept": "*/*", + "body": "" + }, + "sourceResponse": { + "HTTP-Version": {"keepAliveDefault": true}, + "Status-Code": 200, + "Reason-Phrase": "OK", + "response_time_ms": 59, + "content-type": "text/plain; charset=UTF-8", + "content-length": "214", + "body": "health status index uuid ..." + }, + "targetRequest": { + "Request-URI": "/_cat/indices?v", + "Method": "GET", + "HTTP-Version": "HTTP/1.1", + "Host": "opensearchtarget", + "Authorization": "Basic YWRtaW46bXlTdHJvbmdQYXNzd29yZDEyMyE=", + "User-Agent": "curl/8.5.0", + "Accept": "*/*", + "body": "" + }, + "targetResponses": [{ + "HTTP-Version": {"keepAliveDefault": true}, + "Status-Code": 200, + "Reason-Phrase": "OK", + "response_time_ms": 721, + "content-type": "text/plain; charset=UTF-8", + "content-length": "484", + "body": "health status index uuid ..." + }], + "connectionId": "0242acfffe13000a-0000000a-00000005-1eb087a9beb83f3e-a32794b4.0", + "numRequests": 1, + "numErrors": 0 +} +``` + + +## Metrics + +Traffic Replayer emits various OpenTelemetry metrics to Amazon CloudWatch, and traces are sent through AWS X-Ray. The following are some useful metrics that can help evaluate cluster performance. + +### `sourceStatusCode` + +This metric tracks the HTTP status codes for both the source and target clusters, with dimensions for the HTTP verb, such as `GET` or `POST`, and the status code families (200--299). These dimensions can help quickly identify discrepancies between the source and target, such as when `DELETE 200s` becomes `4xx` or `GET 4xx` errors turn into `5xx` errors. + +### `lagBetweenSourceAndTargetRequests` + +This metric shows the delay between requests hitting the source and target clusters. With a speedup factor greater than 1 and a target cluster that can handle requests efficiently, this value should decrease as the replay progresses, indicating a reduction in replay lag. + +### Additional metrics + +The following metrics are also reported: + +- **Throughput**: `bytesWrittenToTarget` and `bytesReadFromTarget` indicate the throughput to and from the cluster. +- **Retries**: `numRetriedRequests` tracks the number of requests retried due to status code mismatches between the source and target. +- **Event counts**: Various `(*)Count` metrics track the number of completed events. +- **Durations**: `(*)Duration` metrics measure the duration of each step in the process. +- **Exceptions**: `(*)ExceptionCount` shows the number of exceptions encountered during each processing phase. + + +## CloudWatch considerations + +Metrics pushed to CloudWatch may experience a visibility lag of around 5 minutes. CloudWatch also retains higher-resolution data for a shorter period than lower-resolution data. For more information, see [Amazon CloudWatch concepts](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html). \ No newline at end of file diff --git a/_migrations/migration-phases/metadata/Metadata-Migration.md b/_migration-assistant/migration-phases/migrating-metadata.md similarity index 58% rename from _migrations/migration-phases/metadata/Metadata-Migration.md rename to _migration-assistant/migration-phases/migrating-metadata.md index 543a249639..249a2ca4d0 100644 --- a/_migrations/migration-phases/metadata/Metadata-Migration.md +++ b/_migration-assistant/migration-phases/migrating-metadata.md @@ -1,40 +1,78 @@ --- layout: default -title: Metadata migration +title: Migrating metadata nav_order: 85 -parent: Metadata -grand_parent: Migration phases +parent: Migration phases --- -# Metadata migration +# Migrating metadata -Metadata migration is a relatively fast process to execute so we recommend attempting this workflow as early as possible to discover any issues which could impact longer running migration steps. +Metadata migration involves creating a snapshot of your cluster and then migrating the metadata from the snapshot using the migration console. -## Prerequisites -A snapshot of the cluster state will need to be taken, [[guide to create a snapshot|Snapshot Creation]]. +This tool gathers information from a source cluster through a snapshot or through HTTP requests against the source cluster. These snapshots are fully compatible with the backfill process for `Reindex-From-Snapshot` (RFS) scenarios. -## Command Arguments +After collecting information on the source cluster, comparisons are made against the target cluster. If running a migration, any metadata items that do not already exist will be created on the target cluster. + +## Creating the snapshot + +Creating a snapshot of the source cluster captures all the metadata and documents to be migrated to a new target cluster. + +Create the initial snapshot of the source cluster using the following command: + +```shell +console snapshot create +``` +{% include copy.html %} + +To check the progress of the snapshot in real time, use the following command: + +```shell +console snapshot status --deep-check +``` +{% include copy.html %} + +You should receive the following response when the snapshot is created: + +```shell +SUCCESS +Snapshot is SUCCESS. +Percent completed: 100.00% +Data GiB done: 29.211/29.211 +Total shards: 40 +Successful shards: 40 +Failed shards: 0 +Start time: 2024-07-22 18:21:42 +Duration: 0h 13m 4s +Anticipated duration remaining: 0h 0m 0s +Throughput: 38.13 MiB/sec +``` + +### Managing slow snapshot speeds + +Depending on the size of the data in the source cluster and the bandwidth allocated for snapshots, the process can take some time. Adjust the maximum rate at which the source cluster's nodes create the snapshot using the `--max-snapshot-rate-mb-per-node` option. Increasing the snapshot rate will consume more node resources, which may affect the cluster's ability to handle normal traffic. + +## Command arguments For the following commands, to identify all valid arguments, please run with `--help`. ```shell console metadata evaluate --help ``` +{% include copy.html %} ```shell console metadata migrate --help ``` +{% include copy.html %} Based on the migration console deployment options, a number of commands will be pre-populated. To view them, run console with verbosity: ```shell console -v metadata migrate --help ``` +{% include copy.html %} -
- -Example "console -v metadata migrate --help" command output - +You should receive a response similar to the following: ```shell (.venv) bash-5.2# console -v metadata migrate --help @@ -47,22 +85,20 @@ INFO:console_link.models.metadata:Migrating metadata with command: /root/metadat . . ``` -
-## Metadata verification with evaluate command + +## Using the `evaluate` command By scanning the contents of the source cluster, applying filtering, and applying modifications a list of all items that will be migrated will be created. Any items not seen in this output will not be migrated onto the target cluster if the migrate command was to be run. This is a safety check before making modifications on the target cluster. ```shell console metadata evaluate [...] ``` +{% include copy.html %} -
- -Example evaluate command output - +You should receive a response similar to the following: -``` +```bash Starting Metadata Evaluation Clusters: Source: @@ -89,22 +125,20 @@ Migration Candidates: Results: 0 issue(s) detected ``` -
-## Metadata migration with migrate command + +## Using the migrate command Running through the same data as the evaluate command all of the migrated items will be applied onto the target cluster. If re-run multiple times items that were previously migrated will not be recreated. If any items do need to be re-migrated, please delete them from the target cluster and then rerun the evaluate then migrate commands to ensure the desired changes are made. ```shell console metadata migrate [...] ``` +{% include copy.html %} -
- -Example migrate command output - +You should receive a response similar to the following: -``` +```shell Starting Metadata Migration Clusters: @@ -132,50 +166,54 @@ Migrated Items: Results: 0 issue(s) detected ``` -
+ ## Metadata verification process Before moving on to additional migration steps, it is recommended to confirm details of your cluster. Depending on your configuration, this could be checking the sharding strategy or making sure index mappings are correctly defined by ingesting a test document. -## Troubleshoot issues +## Troubleshooting + +Use these instructions to help troubleshoot the following issues. + +### Accessing detailed logs -### Access Detailed Logs Metadata migration creates a detailed log file that includes low level tracing information for troubleshooting. For each execution of the program a log file is created inside a shared volume on the migration console named `shared-logs-output` the following command will list all log files, one for each run of the command. ```shell ls -al /shared-logs-output/migration-console-default/*/metadata/ ``` +{% include copy.html %} To inspect the file within the console `cat`, `tail` and `grep` commands line tools. By looking for warnings, errors and exceptions in this log file can help understand the source of failures, or at the very least be useful for creating issues in this project. ```shell tail /shared-logs-output/migration-console-default/*/metadata/*.log ``` +{% include copy.html %} -### Warnings / Errors inline -There might be `WARN` or `ERROR` elements inline the output, they will be accompanied by a short message, such as `WARN - my_index already exists`. Full information will be in the detailed logs associated with this warnings or errors. +### Warnings and errors -### OpenSearch running in compatibility mode -There might be an error about being unable to update an ES 7.10.2 cluster, this can occur when compatibility mode has been enabled on an OpenSearch cluster please disable it to continue, see [Enable compatibility mode](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/rename.html#rename-upgrade) ↗. +When encountering `WARN` or `ERROR` elements in the response, they will be accompanied by a short message, such as `WARN - my_index already exists`. More information can be found in the detailed logs associated with the warning or error. -## How the tool works +### OpenSearch running in compatibility mode -This tool gathers information from a source cluster, through a snapshot or through HTTP requests against the source cluster. These snapshots are fully compatible with the backfill process for Reindex-From-Snapshot (RFS) scenarios, [[learn more|Backfill-Execution]]. +There might be an error about being unable to update an ES 7.10.2 cluster, this can occur when compatibility mode has been enabled on an OpenSearch cluster disable it to continue, see [Enable compatibility mode](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/rename.html#rename-upgrade). -After collecting information on the source cluster comparisons are made on the target cluster. If running a migration, any metadata items do not already exist will be created on the target cluster. ### Breaking change compatibility -Metadata migration needs to modify data from the source to the target versions to recreate items. Sometimes these features are no longer supported and have been removed from the target version. Sometimes these features are not available on the target version, which is especially true when downgrading. While this tool is meant to make this process easier, it is not exhaustive in its support. When encountering a compatibility issue or an important feature gap for your migration, please [search the issues](https://github.com/opensearch-project/opensearch-migrations/issues) and comment + upvote or a [create a new](https://github.com/opensearch-project/opensearch-migrations/issues/new/choose) issue if one cannot be found. +Metadata migration requires modifying data from the source to the target versions to recreate items. Sometimes these features are no longer supported and have been removed from the target version. Sometimes these features are not available in the target version, which is especially true when downgrading. While this tool is meant to make this process easier, it is not exhaustive in its support. When encountering a compatibility issue or an important feature gap for your migration, [search the issues and comment on the existing issue](https://github.com/opensearch-project/opensearch-migrations/issues) or [create a new](https://github.com/opensearch-project/opensearch-migrations/issues/new/choose) issue if one cannot be found. #### Deprecation of Mapping Types + In Elasticsearch 6.8 the mapping types feature was discontinued in Elasticsearch 7.0+ which has created complexity in migrating to newer versions of Elasticsearch and OpenSearch, [learn more](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/removal-of-types.html) ↗. As Metadata migration supports migrating from ES 6.8 on to the latest versions of OpenSearch this scenario is handled by removing the type mapping types and restructuring the template or index properties. Note that, at the time of this writing multiple type mappings are not supported, [tracking task](https://opensearch.atlassian.net/browse/MIGRATIONS-1778) ↗. **Example starting state with mapping type foo (ES 6):** + ```json { "mappings": [ @@ -190,8 +228,10 @@ As Metadata migration supports migrating from ES 6.8 on to the latest versions o ] } ``` +{% include copy.html %} **Example ending state with foo removed (ES 7):** + ```json { "mappings": { @@ -202,5 +242,6 @@ As Metadata migration supports migrating from ES 6.8 on to the latest versions o } } ``` +{% include copy.html %} -*Technical details are available, [view source code](https://github.com/opensearch-project/opensearch-migrations/blob/main/transformation/src/main/java/org/opensearch/migrations/transformation/rules/IndexMappingTypeRemoval.java).* \ No newline at end of file +For additional technical details, [view the mapping type removal source code](https://github.com/opensearch-project/opensearch-migrations/blob/main/transformation/src/main/java/org/opensearch/migrations/transformation/rules/IndexMappingTypeRemoval.java). diff --git a/_migration-assistant/migration-phases/planning-your-migration/assessing-your-cluster-for-migration.md b/_migration-assistant/migration-phases/planning-your-migration/assessing-your-cluster-for-migration.md new file mode 100644 index 0000000000..23bceb7114 --- /dev/null +++ b/_migration-assistant/migration-phases/planning-your-migration/assessing-your-cluster-for-migration.md @@ -0,0 +1,50 @@ +--- +layout: default +title: Assessing your cluster for migration +nav_order: 60 +parent: Planning your migration +grand_parent: Migration phases +redirect_from: + - /migration-assistant/migration-phases/assessing-your-cluster-for-migration/ +--- + +# Assessing your cluster for migration + +The goal of the Migration Assistant is to streamline the process of migrating from one location or version of Elasticsearch/OpenSearch to another. However, completing a migration sometimes requires resolving client compatibility issues before they can communicate directly with the target cluster. + +## Understanding breaking changes + +Before performing any upgrade or migration, you should review any documentation of breaking changes. Even if the cluster is migrated there might be changes required for clients to connect to the new cluster + +## Upgrade and breaking changes guides + +For migrations paths between Elasticsearch 6.8 and OpenSearch 2.x users should be familiar with documentation in the links below that apply to their specific case: + +* [Upgrading Amazon Service Domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/version-migration.html). + +* [Changes from Elasticsearch to OpenSearch fork](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/rename.html). + +* [OpenSearch Breaking Changes](https://opensearch.org/docs/latest/breaking-changes/). + +The next step is to set up a proper test bed to verify that your applications will work as expected on the target version. + +## Impact of data transformations + +Any time you apply a transformation to your data, such as: + +- Changing index names +- Modifying field names or field mappings +- Splitting indices with type mappings + +These changes might need to be reflected in your client configurations. For example, if your clients are reliant on specific index or field names, you must ensure that their queries are updated accordingly. + +We recommend running production-like queries against the target cluster before switching over actual production traffic. This helps verify that the client can: + +- Communicate with the target cluster +- Locate the necessary indices and fields +- Retrieve the expected results + +For complex migrations involving multiple transformations or breaking changes, we highly recommend performing a trial migration with representative, non-production data (e.g., in a staging environment) to fully test client compatibility with the target cluster. + + + diff --git a/_migration-assistant/migration-phases/planning-your-migration/index.md b/_migration-assistant/migration-phases/planning-your-migration/index.md new file mode 100644 index 0000000000..4f0d264e93 --- /dev/null +++ b/_migration-assistant/migration-phases/planning-your-migration/index.md @@ -0,0 +1,15 @@ +--- +layout: default +title: Planning your migration +nav_order: 59 +parent: Migration phases +has_toc: false +has_children: true +--- + +# Planning your migration + +This section describes how to plan for your migration to OpenSearch by: + +- [Assessing your current cluster for migration]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/planning-your-migration/assessing-your-cluster-for-migration/). +- [Verifying that you have the tools for migration]({{site.url}}{{site.baseurl}}/migration-assistant/migration-phases/planning-your-migration/verifying-migration-tools/). \ No newline at end of file diff --git a/_migration-assistant/migration-phases/planning-your-migration/verifying-migration-tools.md b/_migration-assistant/migration-phases/planning-your-migration/verifying-migration-tools.md new file mode 100644 index 0000000000..aa2c5466cd --- /dev/null +++ b/_migration-assistant/migration-phases/planning-your-migration/verifying-migration-tools.md @@ -0,0 +1,208 @@ +--- +layout: default +title: Verifying migration tools +nav_order: 70 +parent: Planning your migration +grand_parent: Migration phases +redirect_from: + - /migration-assistant/migration-phases/verifying-migration-tools/ +--- + +# Verifying migration tools + +Before using the Migration Assistant, take the following steps to verify that your cluster is ready for migration. + +## Verifying snapshot creation + +Verify that a snapshot can be created of your source cluster and used for metadata and backfill scenarios. + +### Installing the Elasticsearch S3 Repository plugin + +The snapshot needs to be stored in a location that Migration Assistant can access. This guide uses Amazon Simple Storage Service (Amazon S3). By default, Migration Assistant creates an S3 bucket for storage. Therefore, it is necessary to install the [Elasticsearch S3 repository plugin](https://www.elastic.co/guide/en/elasticsearch/plugins/7.10/repository-s3.html) on your source nodes (https://www.elastic.co/guide/en/elasticsearch/plugins/7.10/repository-s3.html). + +Additionally, make sure that the plugin has been configured with AWS credentials that allow it to read and write to Amazon S3. If your Elasticsearch cluster is running on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS) instances with an AWS Identity and Access Management (IAM) execution role, include the necessary S3 permissions. Alternatively, you can store the credentials in the [Elasticsearch keystore](https://www.elastic.co/guide/en/elasticsearch/plugins/7.10/repository-s3-client.html). + +### Verifying the S3 repository plugin configuration + +You can verify that the S3 repository plugin is configured correctly by creating a test snapshot. + +Create an S3 bucket for the snapshot using the following AWS Command Line Interface (AWS CLI) command: + +```shell +aws s3api create-bucket --bucket --region +``` +{% include copy.html %} + +Register a new S3 snapshot repository on your source cluster using the following cURL command: + +```shell +curl -X PUT "http://:9200/_snapshot/test_s3_repository" -H "Content-Type: application/json" -d '{ + "type": "s3", + "settings": { + "bucket": "", + "region": "" + } +}' +``` +{% include copy.html %} + +Next, create a test snapshot that captures only the cluster's metadata: + +```shell +curl -X PUT "http://:9200/_snapshot/test_s3_repository/test_snapshot_1" -H "Content-Type: application/json" -d '{ + "indices": "", + "ignore_unavailable": true, + "include_global_state": true +}' +``` +{% include copy.html %} + +Check the AWS Management Console to confirm that your bucket contains the snapshot. + +### Removing test snapshots after verification + +To remove the resources created during verification, you can use the following deletion commands: + +**Test snapshot** + +```shell +curl -X DELETE "http://:9200/_snapshot/test_s3_repository/test_snapshot_1?pretty" +``` +{% include copy.html %} + +**Test snapshot repository** + +```shell +curl -X DELETE "http://:9200/_snapshot/test_s3_repository?pretty" +``` +{% include copy.html %} + +**S3 bucket** + +```shell +aws s3 rm s3:// --recursive +aws s3api delete-bucket --bucket --region +``` +{% include copy.html %} + +### Troubleshooting + +Use this guidance to troubleshoot any of the following snapshot verification issues. + +#### Access denied error (403) + +If you encounter an error like `AccessDenied (Service: Amazon S3; Status Code: 403)`, verify the following: + +- The IAM role assigned to your Elasticsearch cluster has the necessary S3 permissions. +- The bucket name and AWS Region provided in the snapshot configuration match the actual S3 bucket you created. + +#### Older versions of Elasticsearch + +Older versions of the Elasticsearch S3 repository plugin may have trouble reading IAM role credentials embedded in Amazon EC2 and Amazon ECS instances. This is because the copy of the AWS SDK shipped with them is too old to read the new standard way of retrieving those credentials, as shown in [the Instance Metadata Service v2 (IMDSv2) specification](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html). This can result in snapshot creation failures, with an error message similar to the following: + +```json +{"error":{"root_cause":[{"type":"repository_verification_exception","reason":"[migration_assistant_repo] path [rfs-snapshot-repo] is not accessible on master node"}],"type":"repository_verification_exception","reason":"[migration_assistant_repo] path [rfs-snapshot-repo] is not accessible on master node","caused_by":{"type":"i_o_exception","reason":"Unable to upload object [rfs-snapshot-repo/tests-s8TvZ3CcRoO8bvyXcyV2Yg/master.dat] using a single upload","caused_by":{"type":"amazon_service_exception","reason":"Unauthorized (Service: null; Status Code: 401; Error Code: null; Request ID: null)"}}},"status":500} +``` + +If you encounter this issue, you can resolve it by temporarily enabling IMDSv1 on the instances in your source cluster for the duration of the snapshot. There is a toggle for this available in the AWS Management Console as well as in the AWS CLI. Switching this toggle will turn on the older access model and enable the Elasticsearch S3 repository plugin to work as normal. For more information about IMDSv1, see [Modify instance metadata options for existing instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-IMDS-existing-instances.html). + +## Switching over client traffic + +The Migration Assistant Application Load Balancer is deployed with a listener that shifts traffic between the source and target clusters through proxy services. The Application Load Balancer should start in **Source Passthrough** mode. + +### Verifying that the traffic switchover is complete + +Use the following steps to verify that the traffic switchover is complete: + +1. In the AWS Management Console, navigate to **EC2 > Load Balancers**. +2. Select the **MigrationAssistant ALB**. +3. Examine the listener on port `9200` and verify that 100% of the traffic is directed to the **Source Proxy**. +4. Navigate to the **Migration ECS Cluster** in the AWS Management Console. +5. Select the **Target Proxy Service**. +6. Verify that the desired count for the service is running: + * If the desired count is not met, update the service to increase it to at least 1 and wait for the service to start. +7. On the **Health and Metrics** tab under **Load balancer health**, verify that all targets are reporting as healthy: + * This confirms that the Application Load Balancer can connect to the target cluster through the target proxy. +8. (Reset) Update the desired count for the **Target Proxy Service** back to its original value in Amazon ECS. + +### Fixing unidentified traffic patterns + +When switching over traffic to the target cluster, you might encounter unidentified traffic patterns. To help identify the cause of these patterns, use the following steps: +* Verify that the target cluster allows traffic ingress from the **Target Proxy Security Group**. +* Navigate to **Target Proxy ECS Tasks** to investigate any failing tasks. +Set the **Filter desired status** to **Any desired status** to view all tasks, then navigate to the logs for any stopped tasks. + + +## Verifying replication + +Use the following steps to verify that replication is working once the traffic capture proxy is deployed: + + +1. Navigate to the **Migration ECS Cluster** in the AWS Management Console. +2. Navigate to **Capture Proxy Service**. +3. Verify that the capture proxy is running with the desired proxy count. If it is not, update the service to increase it to at least 1 and wait for startup. +4. Under **Health and Metrics** > **Load balancer health**, verify that all targets are healthy. This means that the Application Load Balancer is able to connect to the source cluster through the capture proxy. +5. Navigate to the **Migration Console Terminal**. +6. Run `console kafka describe-topic-records`. Wait 30 seconds for another Application Load Balancer health check. +7. Run `console kafka describe-topic-records` again and verify that the number of RECORDS increased between runs. +8. Run `console replay start` to start Traffic Replayer. +9. Run `tail -f /shared-logs-output/traffic-replayer-default/*/tuples/tuples.log | jq '.targetResponses[]."Status-Code"'` to confirm that the Kafka requests were sent to the target and that it responded as expected. If the responses don't appear: + * Check that the migration console can access the target cluster by running `./catIndices.sh`, which should show the indexes in the source and target. + * Confirm that messages are still being recorded to Kafka. + * Check for errors in the Traffic Replayer logs (`/migration/STAGE/default/traffic-replayer-default`) using CloudWatch. +10. (Reset) Update the desired count for the **Capture Proxy Service** back to its original value in Amazon ECS. + +### Troubleshooting + +Use this guidance to troubleshoot any of the following replication verification issues. + +### Health check responses with 401/403 status code + +If the source cluster is configured to require authentication, the capture proxy will not be able to verify replication beyond receiving a 401/403 status code for Application Load Balancer health checks. For more information, see [Failure Modes](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficCaptureProxyServer/README.md#failure-modes). + +### Traffic does not reach the source cluster + +Verify that the source cluster allows traffic ingress from the Capture Proxy Security Group. + +Look for failing tasks by navigating to **Traffic Capture Proxy ECS**. Change **Filter desired status** to **Any desired status** in order to see all tasks and navigate to the logs for stopped tasks. + + +## Resetting before migration + +After all verifications are complete, reset all resources before using Migration Assistant for an actual migration. + +The following steps outline how to reset resources with Migration Assistant before executing the actual migration. At this point all verifications are expected to have been completed. These steps can be performed after [Accessing the Migration Console]({{site.url}}{{site.baseurl}}/migration-assistant/migration-console/accessing-the-migration-console/). + +### Traffic Replayer + +To stop running Traffic Replayer, use the following command: + +```bash +console replay stop +``` +{% include copy.html %} + +### Kafka + +To clear all captured traffic from the Kafka topic, you can run the following command. + +This command will result in the loss of any traffic data captured by the capture proxy up to this point and thus should be used with caution. +{: .warning} + +```bash +console kafka delete-topic +``` +{% include copy.html %} + +### Target cluster + +To clear non-system indexes from the target cluster that may have been created as a result of testing, you can run the following command: + +This command will result in the loss of all data in the target cluster and should be used with caution. +{: .warning} + +```bash +console clusters clear-indices --cluster target +``` +{% include copy.html %} + diff --git a/_migrations/migration-phases/post-migration-cleanup/Migration-Infrastructure-Teardown.md b/_migration-assistant/migration-phases/removing-migration-infrastructure.md similarity index 55% rename from _migrations/migration-phases/post-migration-cleanup/Migration-Infrastructure-Teardown.md rename to _migration-assistant/migration-phases/removing-migration-infrastructure.md index 779dda81c5..656a8e1998 100644 --- a/_migrations/migration-phases/post-migration-cleanup/Migration-Infrastructure-Teardown.md +++ b/_migration-assistant/migration-phases/removing-migration-infrastructure.md @@ -1,16 +1,21 @@ +--- +layout: default +title: Removing migration infrastructure +nav_order: 120 +parent: Migration phases +--- +# Removing migration infrastructure +After a migration is complete all resources should be removed except for the target cluster, and optionally your Cloudwatch Logs, and Traffic Replayer logs. -After a migration is complete all resources should be removed except for the target cluster, and optionally your Cloudwatch Logs, and Replayer logs. - -## Remove Migration Assistant Infrastructure To remove all the CDK stack(s) which get created during a deployment you can execute a command similar to below within the CDK directory -``` +```bash cdk destroy "*" --c contextId= ``` +{% include copy.html %} Follow the instructions on the command-line to remove the deployed resources from the AWS account. -> [!Note] -> The AWS Console can also be used to verify, remove, and confirm resources for the Migration Assistant are no longer in the account. \ No newline at end of file +The AWS Management Console can also be used to remove Migration Assistant resources and confirm that they are no longer in the account. \ No newline at end of file diff --git a/_migrations/deploying-migration-assistant/configuration-options.md b/_migrations/deploying-migration-assistant/configuration-options.md deleted file mode 100644 index 5a77bd518a..0000000000 --- a/_migrations/deploying-migration-assistant/configuration-options.md +++ /dev/null @@ -1,234 +0,0 @@ ---- -layout: default -title: Configuration options -nav_order: 15 -parent: Deploying migration assistant ---- - -# Configuration options - -This page outlines the configuration options for three key migrations: -1. **Metadata Migration** -2. **Backfill Migration with Reindex-from-Snapshot (RFS)** -3. **Live Capture Migration with Capture and Replay (C&R)** - -Each of these migrations may depend on either a snapshot or a capture proxy. The CDK context blocks below are shown as separate context blocks for each migration type for simplicity. If performing multiple migration types, combine these options, as the actual execution of each migration is controlled from the Migration Console. - -It also has a section describing how to specify the auth details for the source and target cluster (no auth, basic auth with a username and password, or sigv4 auth). - -> [!TIP] -For a complete list of configuration options, please refer to the [opensearch-migrations options.md](https://github.com/opensearch-project/opensearch-migrations/blob/main/deployment/cdk/opensearch-service-migration/options.md) but please open an issue for consultation if changing an option that is not listed on this page. - -Options for the source cluster endpoint, target cluster endpoint, and existing VPC should be configured for the Migration tools to function effectively. - - -## Metadata Migration Options - -## Sample Metadata Migration CDK Options - -```json -{ - "metadata-migration": { - "stage": "dev", - "vpcId": , - "sourceCluster": { - "endpoint": , - "version": "ES 7.10", - "auth": {"type": "none"} - }, - "targetCluster": { - "endpoint": , - "auth": { - "type": "basic", - "username": , - "passwordFromSecretArn": - } - }, - "reindexFromSnapshotServiceEnabled": true, - "artifactBucketRemovalPolicy": "DESTROY" - } -} -``` - -There are currently no CDK options specific to Metadata migrations, which are performed from the Migration Console. This migration requires an existing snapshot, which can be created from the Migration Console. - -
-Shared configuration options table - - -| Name | Example | Description | -|-----------------------|-----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `sourceClusterEndpoint` | `"https://source-cluster.elb.us-east-1.endpoint.com"` | The endpoint for the source cluster. | -| `targetClusterEndpoint` | `"https://vpc-demo-opensearch-cluster-cv6hggdb66ybpk4kxssqt6zdhu.us-west-2.es.amazonaws.com:443"` | The endpoint for the target cluster. Required if using an existing target cluster for the migration instead of creating a new one. | -| `vpcId` | `"vpc-123456789abcdefgh"` | The ID of the existing VPC where the migration resources will be placed. The VPC must have at least two private subnets that span two availability zones. | - -
- -## Backfill Migration with Reindex-from-Snapshot (RFS) Options - -### Sample Backfill Migration CDK Options - -```json -{ - "backfill-migration": { - "stage": "dev", - "vpcId": , - "sourceCluster": { - "endpoint": , - "version": "ES 7.10", - "auth": {"type": "none"} - }, - "targetCluster": { - "endpoint": , - "auth": { - "type": "basic", - "username": , - "passwordFromSecretArn": - } - }, - "reindexFromSnapshotServiceEnabled": true, - "reindexFromSnapshotExtraArgs": "", - "artifactBucketRemovalPolicy": "DESTROY" - } -} -``` - -Performing a Reindex-from-Snapshot backfill migration requires an existing snapshot. The CDK options specific to backfill migrations are listed below. To view all available arguments for `reindexFromSnapshotExtraArgs`, see [here](https://github.com/opensearch-project/opensearch-migrations/blob/main/DocumentsFromSnapshotMigration/README.md#arguments). At a minimum, no extra arguments may be needed. - -
-Backfill specific configuration options table - - -| Name | Example | Description | -|---------------------------------|-----------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `reindexFromSnapshotServiceEnabled` | `true` | Enables deploying and configuring the RFS ECS service. | -| `reindexFromSnapshotExtraArgs` | `"--target-aws-region us-east-1 --target-aws-service-signing-name es"` | Extra arguments for the Document Migration command, with space separation. See the [RFS Extra Arguments](https://github.com/opensearch-project/opensearch-migrations/blob/main/DocumentsFromSnapshotMigration/README.md#arguments) for more details. You can pass `--no-insecure` to remove the `--insecure` flag. | - -
- -## Live Capture Migration with Capture and Replay (C&R) Options - -### Sample Live Capture Migration CDK Options - -```json -{ - "live-capture-migration": { - "stage": "dev", - "vpcId": , - "sourceCluster": { - "endpoint": , - "version": "ES 7.10", - "auth": {"type": "none"} - }, - "targetCluster": { - "endpoint": , - "auth": { - "type": "basic", - "username": , - "passwordFromSecretArn": - } - }, - "captureProxyServiceEnabled": true, - "captureProxyExtraArgs": "", - "trafficReplayerServiceEnabled": true, - "trafficReplayerExtraArgs": "", - "artifactBucketRemovalPolicy": "DESTROY" - } -} -``` - -Performing a live capture migration requires that a Capture Proxy be configured to capture incoming traffic and send it to the target cluster via the Traffic Replayer service. For arguments available in `captureProxyExtraArgs`, refer to the `@Parameter` fields [here](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/CaptureProxy.java). For `trafficReplayerExtraArgs`, refer to the `@Parameter` fields [here](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java). At a minimum, no extra arguments may be needed. - -
-Capture and Replay specific configuration options table - - -| Name | Example | Description | -|--------------------------------|----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `captureProxyServiceEnabled` | `true` | Enables the Capture Proxy service deployment via a new CloudFormation stack. | -| `captureProxyExtraArgs` | `"--suppressCaptureForHeaderMatch user-agent .*elastic-java/7.17.0.*"` | Extra arguments for the Capture Proxy command, including options specified by the [Capture Proxy](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/CaptureProxy.java). | -| `trafficReplayerServiceEnabled` | `true` | Enables the Traffic Replayer service deployment via a new CloudFormation stack. | -| `trafficReplayerExtraArgs` | `"--sigv4-auth-header-service-region es,us-east-1 --speedup-factor 5"` | Extra arguments for the Traffic Replayer command, including options for auth headers and other parameters specified by the [Traffic Replayer](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java). | - -
- -## Cluster Authentication Options - -Both the source and target cluster can use no authentication (e.g. limited to the VPC), basic authentication with a username and password, or SigV4 scoped to a user or role. - -Examples of each of these are below. - -No auth: -``` - "sourceCluster": { - "endpoint": , - "version": "ES 7.10", - "auth": {"type": "none"} - } -``` - -Basic auth: -``` - "sourceCluster": { - "endpoint": , - "version": "ES 7.10", - "auth": { - "type": "basic", - "username": , - "passwordFromSecretArn": - } - } -``` - -SigV4 auth: -``` - "sourceCluster": { - "endpoint": , - "version": "ES 7.10", - "auth": { - "type": "sigv4", - "region": "us-east-1", - "serviceSigningName": "es" - } - } -``` - -The `serviceSigningName` can be `es` for an Elasticsearch or OpenSearch domain, or `aoss` for an OpenSearch Serverless collection. - -All of these auth mechanisms apply to both source and target clusters. - -## Troubleshooting - -### Restricted Permissions -When deploying if part of an [AWS Organization](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_introduction.html) ↗ some permissions / resources might not be allowed. The full list can be generated from the synthesized cdk output with the awsFeatureUsage script. - -``` -/opensearch-migrations/deployment/cdk/opensearch-service-migration/awsFeatureUsage.sh [contextId] -``` - -
-Capture and Replay specific configuration options table - - -```shell -$ /opensearch-migrations/deployment/cdk/opensearch-service-migration/awsFeatureUsage.sh default -Synthesizing all stacks... -Synthesizing stack: networkStack-default -Synthesizing stack: migrationInfraStack -Synthesizing stack: reindexFromSnapshotStack -Synthesizing stack: migration-console -Finding resource usage from synthesized stacks... ------------------------------------ -IAM Policy Actions: -cloudwatch:GetMetricData -... ------------------------------------ -Resources Types: -AWS::CDK::Metadata -... -``` -
- - -### Network Configuration -The migration tooling expects the source cluster, target cluster, and migration resources to exist in the same VPC. If this is not the case, manual networking setup outside of this documentation is likely required. diff --git a/_migrations/deploying-migration-assistant/index.md b/_migrations/deploying-migration-assistant/index.md deleted file mode 100644 index cbe721dd12..0000000000 --- a/_migrations/deploying-migration-assistant/index.md +++ /dev/null @@ -1,5 +0,0 @@ ---- -layout: default -title: Deploying migration assistant -nav_order: 10 ---- \ No newline at end of file diff --git a/_migrations/migration-console/accessing-the-migration-console.md b/_migrations/migration-console/accessing-the-migration-console.md deleted file mode 100644 index 9c09515ab6..0000000000 --- a/_migrations/migration-console/accessing-the-migration-console.md +++ /dev/null @@ -1,41 +0,0 @@ - - - -The Migrations Assistant deployment includes an ECS task that hosts tools to run different phases of the migration and check the progress or results of the migration. - -## SSH into the Migration Console -Following the AWS Solutions deployment, the bootstrap box contains a script that simplifies access to the migration console through that instance. - -To access the Migration Console, use the following commands: - -```shell -export STAGE=dev -export AWS_REGION=us-west-2 -/opensearch-migrations/deployment/cdk/opensearch-service-migration/accessContainer.sh migration-console ${STAGE} ${AWS_REGION} -``` - -When opening the console a message will appear above the command prompt, `Welcome to the Migration Assistant Console`. - -
- - -SSH from any machine into Migration Console - - -On a machine with the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) ↗ and the [AWS Session Manager Plugin](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html) ↗, you can directly connect to the migration console. Ensure you've run `aws configure` with credentials that have access to the environment. - -Use the following commands: - -```shell -export STAGE=dev -export SERVICE_NAME=migration-console -export TASK_ARN=$(aws ecs list-tasks --cluster migration-${STAGE}-ecs-cluster --family "migration-${STAGE}-${SERVICE_NAME}" | jq --raw-output '.taskArns[0]') -aws ecs execute-command --cluster "migration-${STAGE}-ecs-cluster" --task "${TASK_ARN}" --container "${SERVICE_NAME}" --interactive --command "/bin/bash" -``` -
- -## Troubleshooting - -### Deployment Stage - -Typically, `STAGE` is `dev`, but this may vary based on what the user specified during deployment. \ No newline at end of file diff --git a/_migrations/migration-console/index.md b/_migrations/migration-console/index.md deleted file mode 100644 index 78a8011b57..0000000000 --- a/_migrations/migration-console/index.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -layout: default -title: Migration console -nav_order: 30 -has_children: true ---- \ No newline at end of file diff --git a/_migrations/migration-console/migration-console-commands-references.md b/_migrations/migration-console/migration-console-commands-references.md deleted file mode 100644 index 8b906ff9b2..0000000000 --- a/_migrations/migration-console/migration-console-commands-references.md +++ /dev/null @@ -1,101 +0,0 @@ - - - -The Migration Assistant Console is a command line interface to interact with the deployed components of the solution. - -The commands are in the form of `console [component] [action]`. The components include `clusters`, `backfill` (e.g the Reindex from Snapshot service), `snapshot`, `metadata`, `replay`, etc. The console is configured with a registry of the deployed services and the source and target cluster, generated from the `cdk.context.json` values. - -## Commonly Used Commands - -The exact commands used will depend heavily on use-case and goals, but the following are a series of common commands with a quick description of what they do. - -```sh -console clusters connection-check -``` -Reports whether both the source and target clusters can be reached and their versions. - - -```sh -console clusters cat-indices -``` -Runs the `_cat/indices` command on each cluster and prints the results. - -*** - -```sh -console snapshot create -``` -Initiates creating a snapshot on the source cluster, into a pre-configured S3 bucket. - -```sh -console snapshot status --deep-check -``` -Runs a detailed check on the status of the snapshot creation, including estimated completion time. - -*** - -```sh -console metadata evaluate -``` -Perform a dry run of metadata migration, showing which indices, templates, and other objects will be migrated to the target cluster. - -```sh -console metadata migrate -``` -Perform an actual metadata migration. - -*** - -```sh -console backfill start -``` -If the Reindex From Snapshot service is enabled, start an instance of the service to begin moving documents to the target cluster. - -There are similar `scale UNITS` and `stop` commands to change the number of active instances for RFS. - -```sh -console backfill status --deep-check -``` -See the current status of the backfill migration, with the number of instances operating and the progress of the shards. - -*** - -```sh -console replay start -``` -If the Traffic Replayer service is enabled, start an instance of the service to begin replaying traffic against the target cluster. -The `stop` command stops all active instances. - -*** - -```sh -console tuples show --in /shared-logs-output/traffic-replayer-default/[NODE_ID]/tuples/console.log | jq > readable_tuples.json -``` -Use tab completion on the path to fill in the available node ids and, if applicable, log file names. The tuples logs roll over at a certain size threshold, so there may be many files named with timestamps. The `jq` command pretty-prints each line of the tuple output before writing it to file. - - -## Command Reference -All commands and options can be explored within the tool itself by using the `--help` option, either for the entire `console` application or for individual components (e.g. `console backfill --help`). The console also has command autocomplete set up to assist with usage. - -``` -$ console --help -Usage: console [OPTIONS] COMMAND [ARGS]... - -Options: - --config-file TEXT Path to config file - --json - -v, --verbose Verbosity level. Default is warn, -v is info, -vv is - debug. - --help Show this message and exit. - -Commands: - backfill Commands related to controlling the configured backfill... - clusters Commands to interact with source and target clusters - completion Generate shell completion script and instructions for setup. - kafka All actions related to Kafka operations - metadata Commands related to migrating metadata to the target cluster. - metrics Commands related to checking metrics emitted by the capture... - replay Commands related to controlling the replayer. - snapshot Commands to create and check status of snapshots of the... - tuples All commands related to tuples. -``` \ No newline at end of file diff --git a/_migrations/migration-phases/assessment/index.md b/_migrations/migration-phases/assessment/index.md deleted file mode 100644 index feda45c3f7..0000000000 --- a/_migrations/migration-phases/assessment/index.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: default -title: Assessing your cluster -nav_order: 60 -has_children: true -parent: Migration phases ---- \ No newline at end of file diff --git a/_migrations/migration-phases/assessment/required-client-changes.md b/_migrations/migration-phases/assessment/required-client-changes.md deleted file mode 100644 index 0cce4624bf..0000000000 --- a/_migrations/migration-phases/assessment/required-client-changes.md +++ /dev/null @@ -1,41 +0,0 @@ - - - -The goal of the Migration Assistant is to streamline the process of migrating from one location or version of Elasticsearch/OpenSearch to another. However, completing a migration sometimes requires resolving client compatibility issues before they can communicate directly with the target cluster. - -It's crucial to understand and plan for any necessary changes before beginning the migration process. The previous page on [[breaking changes between versions|Understanding-breaking-changes]] is a useful resource for identifying potential issues. - -## Data Transformations and Client Impact - -Any time you apply a transformation to your data, such as: - -- Changing index names -- Modifying field names or field mappings -- Splitting indices with type mappings - -These changes may need to be reflected in your client configurations. For example, if your clients are reliant on specific index or field names, you must ensure that their queries are updated accordingly. - -We recommend running production-like queries against the target cluster before switching over actual production traffic. This helps verify that the client can: - -- Communicate with the target cluster -- Locate the necessary indices and fields -- Retrieve the expected results - -For complex migrations involving multiple transformations or breaking changes, we highly recommend performing a trial migration with representative, non-production data (e.g., in a staging environment) to fully test client compatibility with the target cluster. - -## Troubleshooting - -### Migrating from Elasticsearch (Post-Fork) to OpenSearch - -Migrating from post-fork Elasticsearch (7.10.2+) to OpenSearch presents additional challenges because some Elasticsearch clients include license or version checks that can artificially break compatibility. - -No post-fork Elasticsearch clients are fully compatible with OpenSearch 2.x. We recommend switching to the latest version of the [OpenSearch Clients](https://opensearch.org/docs/latest/clients/) ↗. - -### Inspecting the tuple output - -The Replayer outputs that show the exact requests being sent to both the source and target clusters. Examining these tuples can help you identify any transformations between requests, allowing you to ensure that these changes are reflected in your client code. See [[In-flight Validation]] for details. - -### Related Links - -For more information about OpenSearch clients, refer to the official documentation: - diff --git a/_migrations/migration-phases/assessment/understanding-breaking-changes.md b/_migrations/migration-phases/assessment/understanding-breaking-changes.md deleted file mode 100644 index 73eaee6e66..0000000000 --- a/_migrations/migration-phases/assessment/understanding-breaking-changes.md +++ /dev/null @@ -1,16 +0,0 @@ - - - -Before performing any upgrade or migration, you should review any documentation of breaking changes. Even if the cluster is migrated there might be changes required for clients to connect to the new cluster - -## Upgrade and breaking changes guides - -For migrations paths between Elasticsearch 6.8 and OpenSearch 2.x users should be familiar with documentation in the links below that apply to their specific case: - -* [Upgrading Amazon Service Domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/version-migration.html) ↗ - -* [Changes from Elasticsearch to OpenSearch fork](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/rename.html) ↗ - -* [OpenSearch Breaking Changes](https://opensearch.org/docs/latest/breaking-changes/) ↗ - -The next step is to set up a proper test bed to verify that your applications will work as expected on the target version. diff --git a/_migrations/migration-phases/backfill/backfill-execution.md b/_migrations/migration-phases/backfill/backfill-execution.md deleted file mode 100644 index 828317bada..0000000000 --- a/_migrations/migration-phases/backfill/backfill-execution.md +++ /dev/null @@ -1,94 +0,0 @@ - - - -After the [[Metadata Migration]] has been completed; begin the backfill of documents from the snapshot of the source cluster. - -## Document Migration -Once started a fleet of workers will spin up to read the snapshot and reindex documents on the target cluster. This fleet of workers can be scaled to increased the speed that documents are reindexed onto target cluster. - -### Check the starting state of the clusters - -You can see the indices and rough document counts of the source and target cluster by running the cat-indices command. This can be used to monitor the difference between the source and target for any migration scenario. Check the indices of both clusters with the following command: - -```shell -console clusters cat-indices -``` - -
- -Example cat-indices command output - - -```shell -SOURCE CLUSTER -health status index uuid pri rep docs.count docs.deleted store.size pri.store.size -green open my-index WJPVdHNyQ1KMKol84Cy72Q 1 0 8 0 44.7kb 44.7kb - -TARGET CLUSTER -health status index uuid pri rep docs.count docs.deleted store.size pri.store.size -green open .opendistro_security N3uy88FGT9eAO7FTbLqqqA 1 0 10 0 78.3kb 78.3kb -``` -
- -### Start the backfill - -By starting the backfill by running the following command, which creates a fleet with a single worker: - -```shell -console backfill start -``` - -### Monitor the status - -You can use the status check command to see more detail about things like the number of shards completed, in progress, remaining, and the overall status of the operation: - -```shell -console backfill status --deep-check -``` - -
- -Example status output - - -``` -BackfillStatus.RUNNING -Running=1 -Pending=0 -Desired=1 -Shards total: 48 -Shards completed: 48 -Shards incomplete: 0 -Shards in progress: 0 -Shards unclaimed: 0 -``` -
- ->[!Note] -> The status will be "RUNNING" even if all the shards have been migrated. - -### Scale up the fleet - -To speed up the transfer, you can scale the number of workers. It may take a few minutes for these additional workers to come online. The following command will update the worker fleet to a size of ten: - -```shell -console backfill scale 5 -``` - -### Stopping the migration -Backfill requires manually stopping the fleet. Once all the data has been migrated using by checking the status. You can spin down the fleet and all its workers with the command: - -```shell -console backfill stop -``` - - -## Troubleshooting - -### How to scaling the fleet - -It is recommended to scale up the fleet slowly while monitoring the health metrics of the Target Cluster to avoid over-saturating it. Amazon OpenSearch Domains provide a number of metrics and logs that can provide this insight; refer to [the official documentation on the subject](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/monitoring.html) ↗. The AWS Console for Amazon Opensearch Service surfaces details that can be useful for this as well. - -## Related Links - -- [Technical details for RFS](https://github.com/opensearch-project/opensearch-migrations/blob/main/RFS/docs/DESIGN.md) diff --git a/_migrations/migration-phases/backfill/backfill-result-validation.md b/_migrations/migration-phases/backfill/backfill-result-validation.md deleted file mode 100644 index e6cbd97482..0000000000 --- a/_migrations/migration-phases/backfill/backfill-result-validation.md +++ /dev/null @@ -1,37 +0,0 @@ - - - -After the backfill has been completed and the fleet has been stopped - -## Refresh the target cluster - -Before examining the contents of the target cluster, it is recommended to run a `_refresh` and `_flush` on the target cluster. This will help ensure that the report and metrics of the cluster will be accurate portrayed. - -## Validate documents on target cluster -You can check the contents of the Target Cluster after the migration using the Console CLI: - -``` -console clusters cat-indices --refresh -``` -Example cat-indices command output - -```shell -SOURCE CLUSTER -health status index uuid pri rep docs.count docs.deleted store.size pri.store.size -green open my-index -DqPQDrATw25hhe5Ss34bQ 1 0 3 0 12.7kb 12.7kb - -TARGET CLUSTER -health status index uuid pri rep docs.count docs.deleted store.size pri.store.size -green open .opensearch-observability 8HOComzdSlSWCwqWIOGRbQ 1 1 0 0 416b 208b -green open .plugins-ml-config 9tld-PCJToSUsMiyDhlyhQ 5 1 1 0 9.5kb 4.7kb -green open my-index bGfGtYoeSU6U6p8leR5NAQ 1 0 3 0 5.5kb 5.5kb -green open .migrations_working_state lopd47ReQ9OEhw4ZuJGZOg 1 1 2 0 18.6kb 6.4kb -green open .kibana_1 -``` - -This will display the number of documents on each of the indices on the Target Cluster. It is further recommended to run some queries against the Target Cluster that mimic your production workflow and closely examine the results returned. - -## Related Links - -- [Refresh API](https://opensearch.org/docs/latest/api-reference/index-apis/refresh/) ↗ -- [Flush API](https://opensearch.org/docs/latest/api-reference/index-apis/flush/) ↗ diff --git a/_migrations/migration-phases/backfill/capture-proxy-data-replication.md b/_migrations/migration-phases/backfill/capture-proxy-data-replication.md deleted file mode 100644 index 2ed3b1f4c4..0000000000 --- a/_migrations/migration-phases/backfill/capture-proxy-data-replication.md +++ /dev/null @@ -1,45 +0,0 @@ - - - -The Migration Assistant includes an Application Load Balancer (ALB) for routing traffic to the capture proxy and/or target. Upstream client traffic must be routed through the capture proxy in order to replay the requests later. - -## Assumptions - -* The upstream layer from the ALB is compatible with the certificate on the ALB listener (whether it’s clients or a Network Load Balancer, NLB). - * The `albAcmCertArn` in the `cdk.context.json` may need to be provided to ensure that clients trust the ALB certificate. -* If an NLB is used directly upstream of the ALB, it must use a TLS listener. -* Upstream resources and security groups must allow network access to the Migration Assistant ALB. - -## Steps - -1. In the AWS Console, navigate to **EC2 > Load Balancers > Migration Assistant ALB**. -2. Note down the ALB URL. -3. If you are using **NLB → ALB → Cluster**: - 1. Ensure ingress is provided directly to the ALB for the capture proxy. - 2. Create a target group for the Migration Assistant ALB on port 9200, and set the health check to HTTPS. - 3. Associate this target group with your existing NLB on a new listener (for testing). - 4. Verify the health check is successful, and perform smoke testing with some clients through the new listener port. - 5. Once ready to migrate all clients, detach the Migration Assistant ALB target group from the testing NLB listener and modify the existing NLB listener to direct traffic to this target group. - 6. Now, client requests will be routed through the proxy (once they establish a new connection). Verify application metrics. -4. If you are using **NLB → Cluster**: - 1. If you do not wish to modify application logic, add an ALB in front of your cluster and follow the **NLB → ALB → Cluster** steps. Otherwise: - 2. Create a target group for the ALB on port 9200 and set the health check to HTTPS. - 3. Associate this target group with your existing NLB on a new listener. - 4. Verify the health check is successful, and perform smoke testing with some clients through the new listener port. - 5. Once ready to migrate all clients, deploy a change so that clients hit the new listener. -5. If you are **not using an NLB**: - 1. Make a client/DNS change to route clients to the Migration Assistant ALB on port 9200. -6. In the Migration Console, execute the following command: - ```shell - console kafka describe-topic-records - ``` - Note the records in the logging topic. -7. After a short period, execute the same command again and compare the increase in records against the expected HTTP requests. - -### Troubleshooting - -* Investigate the ALB listener security policy, security groups, ALB certificates, and the proxy's connection to Kafka. - -### Related Links - -- [Migration Console ALB Documentation](https://github.com/opensearch-project/opensearch-migrations/blob/main/docs/ClientTrafficSwinging.md) \ No newline at end of file diff --git a/_migrations/migration-phases/backfill/index.md b/_migrations/migration-phases/backfill/index.md deleted file mode 100644 index 8bdb1e2a8d..0000000000 --- a/_migrations/migration-phases/backfill/index.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: default -title: Backfill -nav_order: 90 -has_children: true -parent: Migration phases ---- \ No newline at end of file diff --git a/_migrations/migration-phases/client-traffic-switchover/Switching-Traffic-from-Source-to-Target.md b/_migrations/migration-phases/client-traffic-switchover/Switching-Traffic-from-Source-to-Target.md deleted file mode 100644 index ea7182989a..0000000000 --- a/_migrations/migration-phases/client-traffic-switchover/Switching-Traffic-from-Source-to-Target.md +++ /dev/null @@ -1,37 +0,0 @@ - - -After the source and target clusters are in sync traffic needs to be switched to the target cluster so the source cluster can be taken offline. - -## Assumptions -- All client traffic is being routed through switchover listener in MigrationAssistant ALB -- Client traffic has been verified to be compatible with Target Cluster -- Target cluster is in a good state to accept client traffic (i.e. backfill/replay is complete as needed) -- Target Proxy Service is deployed - -## Switch Traffic to the Source Cluster -1. Within the AWS Console, navigate to ECS > Migration Assistant Cluster -1. Note down the desired count of the Capture Proxy (it should be > 1) -1. Update the ECS Service of the Target Proxy to be at least as large as the Traffic Capture Proxy -1. Wait for tasks to startup, verify all targets healthy within Target Proxy Service "Load balancer health" -1. Within the AWS Console, navigate to EC2 > Load Balancers > Migration Assistant ALB -1. Navigate to ALB Metrics and examine any information which may be useful - 1. Specifically look at Active Connection Count and New Connection Count and note if theres a big discrepancy, this can indicate a reused connections which affect how traffic will switchover. Once an ALB is re-routed, existing connections will still be routed to the capture proxy until the client/source cluster terminates those. -1. Navigate to the Capture Proxy Target Group (ALBSourceProxy--TG) > Monitoring -1. Examine Metrics Requests, Target (2XX, 3XX, 4XX, 5XX), and Target Response Time, Metrics - 1. Verify that this looks as expected and includes all traffic expected to be included in the switchover - 1. Note details that would help identify anomalies during the switchover including expected response time and response code rate. -1. Navigate back to ALB and click on Target Proxy Target Group (ALBTargetProxy--TG) -1. Verify all expected targets are healthy and none are in draining state -1. Navigate back to ALB and to the Listener on port 9200 -1. Click on the Default rule and Edit -1. Modify the weights of the targets to shift desired traffic over to the target proxy - 1. To perform a full switchover, modify the weight to 1 on Target Proxy and 0 on Source Proxy -1. Click Save Changes -1. Navigate to both SourceProxy and TargetProxy TG Monitoring metrics and verify traffic is shifting over as expected - 1. If connections are being reused by clients, perform any actions if needed to terminate those to get the clients to shift over. - 1. Monitor until SourceProxy TG shows 0 requests when it is known all clients have switchover - -## Troubleshooting - -### Fallback -If needed to fallback, revert the Default rule to have the ALB route to the SourceProxy Target Group \ No newline at end of file diff --git a/_migrations/migration-phases/client-traffic-switchover/index.md b/_migrations/migration-phases/client-traffic-switchover/index.md deleted file mode 100644 index 1a381a0ef6..0000000000 --- a/_migrations/migration-phases/client-traffic-switchover/index.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: default -title: Client traffic switchover -nav_order: 110 -has_children: true -parent: Migration phases ---- \ No newline at end of file diff --git a/_migrations/migration-phases/index.md b/_migrations/migration-phases/index.md deleted file mode 100644 index 95cd93b495..0000000000 --- a/_migrations/migration-phases/index.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -layout: default -title: Migration phases -nav_order: 50 -has_children: true ---- \ No newline at end of file diff --git a/_migrations/migration-phases/metadata/Snapshot-Creation.md b/_migrations/migration-phases/metadata/Snapshot-Creation.md deleted file mode 100644 index 683b097d5e..0000000000 --- a/_migrations/migration-phases/metadata/Snapshot-Creation.md +++ /dev/null @@ -1,56 +0,0 @@ ---- -layout: default -title: Snapshot creation -nav_order: 90 -parent: Metadata -grand_parent: Migration phases ---- - - -# Snapshot creation - -Creating a snapshot of the source cluster capture all the metadata and documents to migrate onto a new target cluster. - -## Limitations - -Incremental or "delta" snapshots are not yet supported. For more information, refer to the [tracking issue MIGRATIONS-1624](https://opensearch.atlassian.net/browse/MIGRATIONS-1624). A single snapshot must be used for a backfill. - -## Snapshot Creation from the Console - -Create the initial snapshot on the source cluster with the following command: - -```shell -console snapshot create -``` - -To check the progress of the snapshot in real-time, use the following command: - -```shell -console snapshot status --deep-check -``` - -
-Example Output When a Snapshot is Completed - -```shell -console snapshot status --deep-check - -SUCCESS -Snapshot is SUCCESS. -Percent completed: 100.00% -Data GiB done: 29.211/29.211 -Total shards: 40 -Successful shards: 40 -Failed shards: 0 -Start time: 2024-07-22 18:21:42 -Duration: 0h 13m 4s -Anticipated duration remaining: 0h 0m 0s -Throughput: 38.13 MiB/sec -``` -
- -## Troubleshooting - -### Slow Snapshot Speed - -Depending on the size of the data on the source cluster and the bandwidth allocated for snapshots, the process can take some time. Adjust the maximum rate at which the source cluster's nodes create the snapshot using the `--max-snapshot-rate-mb-per-node` option. Increasing the snapshot rate will consume more node resources, which may affect the cluster's ability to handle normal traffic. If not specified, the default rate for the source cluster's version will be used. For more details, refer to the [Elasticsearch 7.10 snapshot documentation](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/put-snapshot-repo-api.html#put-snapshot-repo-api-request-body) ↗. \ No newline at end of file diff --git a/_migrations/migration-phases/metadata/index.md b/_migrations/migration-phases/metadata/index.md deleted file mode 100644 index ac5618ffd4..0000000000 --- a/_migrations/migration-phases/metadata/index.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: default -title: Metadata -nav_order: 80 -has_children: true -parent: Migration phases ---- \ No newline at end of file diff --git a/_migrations/migration-phases/post-migration-cleanup/index.md b/_migrations/migration-phases/post-migration-cleanup/index.md deleted file mode 100644 index 3f76d3ed34..0000000000 --- a/_migrations/migration-phases/post-migration-cleanup/index.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: default -title: Migration infrastructure teardown -nav_order: 120 -has_children: true -parent: Migration phases ---- \ No newline at end of file diff --git a/_migrations/migration-phases/replay-activation-and-validation/In-flight-Validation.md b/_migrations/migration-phases/replay-activation-and-validation/In-flight-Validation.md deleted file mode 100644 index b2003700f7..0000000000 --- a/_migrations/migration-phases/replay-activation-and-validation/In-flight-Validation.md +++ /dev/null @@ -1,152 +0,0 @@ - - - -The Replayer is a long-running process that makes requests to a target cluster to maintain synchronization with the source cluster and enable users to compare the performance between the two clusters. There are two primary ways to assess how the target requests are being handled: through logs and metrics. - -## Result Logs - -HTTP transactions from the source capture and those resent to the target cluster are logged in files located at `/shared-logs-output/traffic-replayer-default/*/tuples/tuples.log`. The `/shared-logs-output` directory is shared across containers, including the migration console. Users can access these files from the migration console using the same path. Previous runs are also available in gzipped form. - -Each log entry is a newline-delimited JSON object, which contains details of the source and target requests/responses along with other transaction details, such as response times. - -> **Note:** These logs contain the contents of all requests, including Authorization headers and the contents of all HTTP messages. Ensure that access to the migration environment is restricted as these logs serve as a source of truth for determining what happened on both the source and target clusters. Response times for the source refer to the time between the proxy sending the end of a request and receiving the response. While response times for the target are recorded in the same manner, keep in mind that the locations of the capture proxy, replayer, and target may differ, and these logs do not account for the client's location. - -
- -Example Log Entry - - -Below is an example log entry for a `/_cat/indices?v` request sent to both the source and target clusters: - -```json -{ - "sourceRequest": { - "Request-URI": "/_cat/indices?v", - "Method": "GET", - "HTTP-Version": "HTTP/1.1", - "Host": "capture-proxy:9200", - "Authorization": "Basic YWRtaW46YWRtaW4=", - "User-Agent": "curl/8.5.0", - "Accept": "*/*", - "body": "" - }, - "sourceResponse": { - "HTTP-Version": {"keepAliveDefault": true}, - "Status-Code": 200, - "Reason-Phrase": "OK", - "response_time_ms": 59, - "content-type": "text/plain; charset=UTF-8", - "content-length": "214", - "body": "aGVhbHRoIHN0YXR1cyBpbmRleCAgICAgICB..." - }, - "targetRequest": { - "Request-URI": "/_cat/indices?v", - "Method": "GET", - "HTTP-Version": "HTTP/1.1", - "Host": "opensearchtarget", - "Authorization": "Basic YWRtaW46bXlTdHJvbmdQYXNzd29yZDEyMyE=", - "User-Agent": "curl/8.5.0", - "Accept": "*/*", - "body": "" - }, - "targetResponses": [{ - "HTTP-Version": {"keepAliveDefault": true}, - "Status-Code": 200, - "Reason-Phrase": "OK", - "response_time_ms": 721, - "content-type": "text/plain; charset=UTF-8", - "content-length": "484", - "body": "aGVhbHRoIHN0YXR1cyBpbmRleCAgICAgICB..." - }], - "connectionId": "0242acfffe13000a-0000000a-00000005-1eb087a9beb83f3e-a32794b4.0", - "numRequests": 1, - "numErrors": 0 -} -``` -
- -### Decoding Log Content - -The contents of HTTP message bodies are Base64 encoded to handle various types of traffic, including compressed data. To view the logs in a more human-readable format, use the console library `tuples show`. Running the script as follows will produce a `readable-tuples.log` in the home directory: - -```shell -console tuples show --in /shared-logs-output/traffic-replayer-default/d3a4b31e1af4/tuples/tuples.log > readable-tuples.log -``` - -
- -Example log entry would look after running the script - - -```json -{ - "sourceRequest": { - "Request-URI": "/_cat/indices?v", - "Method": "GET", - "HTTP-Version": "HTTP/1.1", - "Host": "capture-proxy:9200", - "Authorization": "Basic YWRtaW46YWRtaW4=", - "User-Agent": "curl/8.5.0", - "Accept": "*/*", - "body": "" - }, - "sourceResponse": { - "HTTP-Version": {"keepAliveDefault": true}, - "Status-Code": 200, - "Reason-Phrase": "OK", - "response_time_ms": 59, - "content-type": "text/plain; charset=UTF-8", - "content-length": "214", - "body": "health status index uuid ..." - }, - "targetRequest": { - "Request-URI": "/_cat/indices?v", - "Method": "GET", - "HTTP-Version": "HTTP/1.1", - "Host": "opensearchtarget", - "Authorization": "Basic YWRtaW46bXlTdHJvbmdQYXNzd29yZDEyMyE=", - "User-Agent": "curl/8.5.0", - "Accept": "*/*", - "body": "" - }, - "targetResponses": [{ - "HTTP-Version": {"keepAliveDefault": true}, - "Status-Code": 200, - "Reason-Phrase": "OK", - "response_time_ms": 721, - "content-type": "text/plain; charset=UTF-8", - "content-length": "484", - "body": "health status index uuid ..." - }], - "connectionId": "0242acfffe13000a-0000000a-00000005-1eb087a9beb83f3e-a32794b4.0", - "numRequests": 1, - "numErrors": 0 -} -``` -
- -## Metrics - -The Replayer emits various OpenTelemetry metrics to CloudWatch, and traces are sent through AWS X-Ray. Here are some useful metrics that help evaluate cluster performance: - -### `sourceStatusCode` - -This metric tracks the HTTP status codes for both the source and target clusters, with dimensions for the HTTP verb (e.g., GET, POST) and the status code families (e.g., 200-299). These dimensions help quickly identify discrepancies between the source and target, such as when DELETE 200s become 4xx or GET 4xx errors turn into 5xx errors. - -### `lagBetweenSourceAndTargetRequests` - -This metric shows the delay between requests hitting the source and target clusters. With a speedup factor greater than 1 and a target cluster that can handle requests efficiently, this value should decrease as the replay progresses, indicating a reduction in replay lag. - -### Additional Metrics - -- **Throughput**: `bytesWrittenToTarget` and `bytesReadFromTarget` indicate the throughput to and from the cluster. -- **Retries**: `numRetriedRequests` tracks the number of requests retried due to status-code mismatches between the source and target. -- **Event Counts**: Various `(*)Count` metrics track the number of specific events that have completed. -- **Durations**: `(*)Duration` metrics measure the duration of each step in the process. -- **Exceptions**: `(*)ExceptionCount` shows the number of exceptions encountered during each processing phase. - -## Troubleshooting - -### CloudWatch Considerations - -Metrics pushed to CloudWatch may experience around a 5-minute visibility lag. CloudWatch also retains higher-resolution data for a shorter period than lower-resolution data. For more details, see [CloudWatch Metrics Retention Policies](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html) ↗. diff --git a/_migrations/migration-phases/replay-activation-and-validation/Synchronized-Cluster-Validation.md b/_migrations/migration-phases/replay-activation-and-validation/Synchronized-Cluster-Validation.md deleted file mode 100644 index a5c1759ae7..0000000000 --- a/_migrations/migration-phases/replay-activation-and-validation/Synchronized-Cluster-Validation.md +++ /dev/null @@ -1,161 +0,0 @@ - - - -This guide covers how to use the Replayer to replay captured traffic from a source cluster to a target cluster during the migration process. The Replayer allows users to verify that the target cluster can handle requests in the same way as the source cluster and catch up to real-time traffic for a smooth migration. - -## Replayer Configurations - -[Replayer settings](Configuration-Options) are configured during the deployment of the Migration Assistant. Make sure to set the authentication mode for the Replayer so it can properly communicate with the target cluster. Refer to the **Limitations** section below for details on how different traffic types are handled. - -### Speedup Factor - -The `--speedup-factor` option, passed via `trafficReplayerExtraArgs`, adjusts the wait times between requests. For example: -- A speedup factor of `2` sends requests at twice the original speed (e.g., a request originally sent every minute will now be sent every 30 seconds). -- A speedup factor of `0.5` will space requests further apart (e.g., requests every 2 minutes instead of every minute). - -This setting can be used to stress test the target cluster or to catch up to real-time traffic, ensuring the target cluster is ready for production client switchover. - -## When to Run the Replayer - -After deploying the Migration Assistant, the Replayer is not running by default. It should be started only after all metadata and documents have been migrated to ensure that recent changes to the source cluster are properly reflected in the target cluster. - -For example, if a document was deleted after a snapshot was taken, starting the Replayer before the document migration is complete may cause the deletion request to execute before the document is even added to the target. Running the Replayer after all other migration processes ensures that the target cluster will be consistent with the source cluster. - -## Using the Replayer - -To manage the Replayer, use the `console replay` command: - -- **Start the Replayer**: - ```bash - console replay start - ``` - This starts the Replayer with the options specified at deployment. - -- **Check Replayer Status**: - ```bash - console replay status - ``` - This command shows whether the Replayer is running, pending, or desired. "Running" shows how many container instances are actively running, "Pending" indicates how many are being provisioned, and "Desired" shows the total number of instances that should be running. - -- **Stop the Replayer**: - ```bash - console replay stop - ``` - -
- -Example Interactions - - -Check the status of the Replayer: -```bash -root@ip-10-0-2-66:~# console replay status -(, 'Running=0\nPending=0\nDesired=0') -``` - -Start the Replayer: -```bash -root@ip-10-0-2-66:~# console replay start -Replayer started successfully. -Service migration-dev-traffic-replayer-default set to 1 desired count. Currently 0 running and 0 pending. -``` - -Stop the Replayer: -```bash -root@ip-10-0-2-66:~# console replay stop -Replayer stopped successfully. -Service migration-dev-traffic-replayer-default set to 0 desired count. Currently 0 running and 0 pending. -``` -
- -### Delivery Guarantees - -The Replayer pulls traffic from Kafka and advances its commit cursor after requests have been sent to the target cluster. This provides an "at least once" delivery guarantee—requests will be replayed, but success is not guaranteed. You will need to monitor metrics, tuple outputs, or external validation to ensure the target cluster is performing as expected. - -## Time Scaling - -The Replayer sends requests in the same order they were received on each connection to the source. However, relative timing between different connections is not guaranteed. For example: - -- **Scenario**: Two connections exist—one sends a PUT request every minute, and the other sends a GET request every second. -- **Behavior**: The Replayer will maintain the sequence within each connection, but the relative timing between the connections (PUTs and GETs) is not preserved. - -### Speedup Factor Example - -Assume a source cluster responds to requests (GETs and PUTs) within 100ms: -- With a **speedup factor of 1**, the target will experience the same request rates and idle periods as the source. -- With a **speedup factor of 2**, requests will be sent twice as fast, with GETs sent every 500ms and PUTs every 30 seconds. -- At a **speedup factor of 10**, requests will be sent 10x faster, and as long as the target responds quickly, the Replayer can keep pace. - -If the target cannot respond fast enough, the Replayer will wait for the previous request to complete before sending the next one. This may cause delays and affect global relative ordering. - -## Transformations - -During migrations, some requests may need to be transformed between versions. For example, Elasticsearch supported multiple type mappings in indices, but this is no longer the case in OpenSearch. Clients may need to adjust accordingly by splitting documents into multiple indices or transforming request data. - -The Replayer automatically rewrites host and authentication headers, but for more complex transformations, custom transformation rules can be passed via the `--transformer-config` option (as described in the [Traffic Replayer README](https://github.com/opensearch-project/opensearch-migrations/blob/c3d25958a44ec2e7505892b4ea30e5fbfad4c71b/TrafficCapture/trafficReplayer/README.md#transformations)). - -### Example Transformation - -Suppose a source request includes a "tagToExcise" element that needs to be removed and its children promoted, and the URI path includes "extraThingToRemove" which should also be removed. The following Jolt script handles this transformation: - -```json -[{ "JsonJoltTransformerProvider": -[ - { - "script": { - "operation": "shift", - "spec": { - "payload": { - "inlinedJsonBody": { - "top": { - "tagToExcise": { - "*": "payload.inlinedJsonBody.top.&" - }, - "*": "payload.inlinedJsonBody.top.&" - }, - "*": "payload.inlinedJsonBody.&" - }, - "*": "payload.&" - }, - "*": "&" - } - } - }, - { - "script": { - "operation": "modify-overwrite-beta", - "spec": { - "URI": "=split('/extraThingToRemove',@(1,&))" - } - } - }, - { - "script": { - "operation": "modify-overwrite-beta", - "spec": { - "URI": "=join('',@(1,&))" - } - } - } -] -}] -``` - -The resulting request to the target will look like this: - -```http -PUT /oldStyleIndex/moreStuff HTTP/1.0 -host: testhostname - -{"top":{"properties":{"field1":{"type":"text"},"field2":{"type":"keyword"}}}} -``` - -You can pass Base64-encoded transformation scripts via `--transformer-config-base64` for convenience. - -## Troubleshooting - -### Client changes -See [[Required Client Changes]] for more information on how clients will need to be updated. - -### Request Delivery -The Replayer provides an "at least once" delivery guarantee but does not ensure request success when a replayed request arrives at the target cluster. diff --git a/_migrations/migration-phases/replay-activation-and-validation/index.md b/_migrations/migration-phases/replay-activation-and-validation/index.md deleted file mode 100644 index ab8aa175cb..0000000000 --- a/_migrations/migration-phases/replay-activation-and-validation/index.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: default -title: Replay activation and validation -nav_order: 100 -has_children: true -parent: Migration phases ---- \ No newline at end of file diff --git a/_migrations/migration-phases/setup-verification/Client-Traffic-Switchover-Verification.md b/_migrations/migration-phases/setup-verification/Client-Traffic-Switchover-Verification.md deleted file mode 100644 index 617e16db41..0000000000 --- a/_migrations/migration-phases/setup-verification/Client-Traffic-Switchover-Verification.md +++ /dev/null @@ -1,23 +0,0 @@ - - - -The Migrations ALB is deployed with a listener that shifts traffic between the source and target clusters through proxy services. The ALB should start in **Source Passthrough** mode. - -## Verify traffic switchover has completed -1. In the AWS Console, navigate to **EC2 > Load Balancers**. -2. Select the **MigrationAssistant ALB**. -3. Examine the listener on port 9200 and verify that 100% of traffic is directed to the **Source Proxy**. -4. Navigate to the **Migration ECS Cluster** in the AWS Console. -5. Select the **Target Proxy Service**. -6. Verify that the desired count for the service is running: - * If the desired count is not met, update the service to increase it to at least 1 and wait for the service to start. -7. In the **Health and Metrics** tab under **Load balancer health**, verify that all targets are reporting as healthy: - * This confirms the ALB can connect to the target cluster through the target proxy. -8. (Reset) Update the desired count for the **Target Proxy Service** back to its original value in ECS. - -## Troubleshooting - -### Unexpected traffic patterns -* Verify that the target cluster allows traffic ingress from the **Target Proxy Security Group**. -* Navigate to the **Target Proxy ECS Tasks** to investigate any failing tasks: - * Set the "Filter desired status" to "Any desired status" to view all tasks, then navigate to the logs for any stopped tasks. \ No newline at end of file diff --git a/_migrations/migration-phases/setup-verification/Snapshot-Creation-Verification.md b/_migrations/migration-phases/setup-verification/Snapshot-Creation-Verification.md deleted file mode 100644 index 41f55b7eaf..0000000000 --- a/_migrations/migration-phases/setup-verification/Snapshot-Creation-Verification.md +++ /dev/null @@ -1,100 +0,0 @@ - - - -Verify that a snapshot can be created and used for metadata and backfill scenarios. - -### Install the Elasticsearch S3 Repository Plugin - -The snapshot needs to be stored in a location that the Migration Assistant can access. We use AWS S3 as that location, and the Migration Assistant creates an S3 bucket for this purpose. Therefore, it is necessary to install the Elasticsearch S3 Repository Plugin on your source nodes [as described here](https://www.elastic.co/guide/en/elasticsearch/plugins/7.10/repository-s3.html) ↗. - -Additionally, ensure that the plugin has been configured with AWS credentials that allow it to read and write to AWS S3. If your Elasticsearch cluster is running on EC2 or ECS instances with an execution IAM Role, include the necessary S3 permissions. Alternatively, you can store the credentials in the Elasticsearch Key Store [as described here](https://www.elastic.co/guide/en/elasticsearch/plugins/7.10/repository-s3-client.html) ↗. - -### Verifying S3 Repository Plugin Configuration - -You can verify that the S3 Repository Plugin is configured correctly by creating a test snapshot. - -Create an S3 bucket for the snapshot using the following AWS CLI command: - -```shell -aws s3api create-bucket --bucket --region -``` - -Register a new S3 Snapshot Repository on your source cluster using this curl command: - -```shell -curl -X PUT "http://:9200/_snapshot/test_s3_repository" -H "Content-Type: application/json" -d '{ - "type": "s3", - "settings": { - "bucket": "", - "region": "" - } -}' -``` - -You should receive a response like: `{"acknowledged":true}`. - -Create a test snapshot that captures only the cluster's metadata: - -```shell -curl -X PUT "http://:9200/_snapshot/test_s3_repository/test_snapshot_1" -H "Content-Type: application/json" -d '{ - "indices": "", - "ignore_unavailable": true, - "include_global_state": true -}' -``` - -
-Example Response - -You should receive a response like: `{"accepted":true}`. - -Check the AWS Console to confirm that your bucket contains the snapshot. It will appear similar to this: - -![Screenshot 2024-08-06 at 3 25 25 PM](https://github.com/user-attachments/assets/200818a5-e259-4837-aa2a-44c0bd7b099c) -
- -### Cleaning Up After Verification - -To remove the resources created during verification: - -Delete the snapshot: - -```shell -curl -X DELETE "http://:9200/_snapshot/test_s3_repository/test_snapshot_1?pretty" -``` - -Delete the snapshot repository: - -```shell -curl -X DELETE "http://:9200/_snapshot/test_s3_repository?pretty" -``` - -Delete the S3 bucket and its contents: - -```shell -aws s3 rm s3:// --recursive -aws s3api delete-bucket --bucket --region -``` - -### Troubleshooting - -#### Access Denied Error (403) - -If you encounter an error like `AccessDenied (Service: Amazon S3; Status Code: 403)`, verify the following: - -- The IAM role assigned to your Elasticsearch cluster has the necessary S3 permissions. -- The bucket name and region provided in the snapshot configuration match the actual S3 bucket you created. - -#### Older versions of Elasticsearch - -Older versions of the Elasticsearch S3 Repository Plugin may have trouble reading IAM Role credentials embedded in EC2 and ECS Instances. This is due to the copy of the AWS SDK shipped in them being being too old to read the new standard way of retrieving those credentials - [the Instance Metadata Service v2 (IMDSv2) specification](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html). This can result in Snapshot creation failures, with an error message like: - -``` -{"error":{"root_cause":[{"type":"repository_verification_exception","reason":"[migration_assistant_repo] path [rfs-snapshot-repo] is not accessible on master node"}],"type":"repository_verification_exception","reason":"[migration_assistant_repo] path [rfs-snapshot-repo] is not accessible on master node","caused_by":{"type":"i_o_exception","reason":"Unable to upload object [rfs-snapshot-repo/tests-s8TvZ3CcRoO8bvyXcyV2Yg/master.dat] using a single upload","caused_by":{"type":"amazon_service_exception","reason":"Unauthorized (Service: null; Status Code: 401; Error Code: null; Request ID: null)"}}},"status":500} -``` - -If you encounter this issue, you can resolve it by temporarily enabling IMDSv1 on the Instances in your source cluster for the duration of the snapshot. There is a toggle for it available in [the AWS Console](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html), as well as in [the AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/ec2/modify-instance-metadata-options.html#options). Flipping this toggle will turn on the older access model and enable the Elasticsearch S3 Repository Plugin to work as normal. - -### Related Links - -- [Elasticsearch S3 Repository Plugin Configuration Guide](https://www.elastic.co/guide/en/elasticsearch/plugins/7.10/repository-s3-client.html) ↗. diff --git a/_migrations/migration-phases/setup-verification/System-Reset-Before-Migration.md b/_migrations/migration-phases/setup-verification/System-Reset-Before-Migration.md deleted file mode 100644 index 708dbe5589..0000000000 --- a/_migrations/migration-phases/setup-verification/System-Reset-Before-Migration.md +++ /dev/null @@ -1,22 +0,0 @@ - - - -The following steps outline how to reset resources with Migration Assistant before executing the actual migration. At this point all verifications are expected to have been completed. These steps can be performed after [[Accessing the Migration Console]] - -## Replayer Stoppage -To stop a running Replayer service, the following command can be executed: -``` -console replay stop -``` - -## Kafka Reset -The clear all captured traffic from the Kafka topic, the following command can be executed. **Note**: This command will result in the loss of any captured traffic data up to this point by the capture proxy and thus should be used with caution. -``` -console kafka delete-topic -``` - -## Target Cluster Reset -To clear non-system indices from the target cluster that may have been created from testing, the following command can be executed. **Note**: This command will result in the loss of all data on the target cluster and should be used with caution. -``` -console clusters clear-indices --cluster target -``` diff --git a/_migrations/migration-phases/setup-verification/Traffic-Capture-Verification.md b/_migrations/migration-phases/setup-verification/Traffic-Capture-Verification.md deleted file mode 100644 index 8a99a941d3..0000000000 --- a/_migrations/migration-phases/setup-verification/Traffic-Capture-Verification.md +++ /dev/null @@ -1,36 +0,0 @@ - - - -This guide will describe how once the traffic capture proxy is deployed the captured traffic can be verified. - -## Replication setup and validation -1. Navigate to Migration ECS Cluster in AWS Console -1. Navigate to Capture Proxy Service -1. Verify > 0 desired count and running - * if not, update service to increase to at least 1 and wait for startup -1. Within "Load balancer health" on "Health and Metrics" tab, verify all targets are reporting healthy - * This means the ALB is able to connect to the source cluster through the capture proxy -1. Navigate to the Migration Console Terminal -1. Execute `console kafka describe-topic-records` -1. Wait 30 seconds for another elb health check to be recorded -1. Execute `console kafka describe-topic-records` again, Verify RECORDS increased between runs -1. Execute `console replay start` to start the replayer -1. Run `tail -f /shared-logs-output/traffic-replayer-default/*/tuples/tuples.log | jq '.targetResponses[]."Status-Code"'` to confirm that the Kafka requests were sent to the the target and that it responded as expected... If responses don't appear - * check that the migration-console can access the target cluster by running `./catIndices.sh`, which should show indices on the source and target. - * confirm that messages are still being recorded to Kafka. - * check for errors in the replayer logs ("/migration/STAGE/default/traffic-replayer-default") via CloudWatch -1. (Reset) Update Traffic Capture Proxy service desired count back to original value in ECS - -## Troubleshooting - -### Health checks response with 401/403 status code -If the source cluster is configured to require authentication the capture proxy will not be able to verify beyond receiving 401/403 status code for ALB healthchecks - -### Traffic does not reach the source cluster -Verify the Source Cluster allows traffic ingress from Capture Proxy Security Group. - -Look for failing tasks by navigating to Traffic Capture Proxy ECS Tasks. Change "Filter desired status" to "Any desired status" in order to see all tasks and navigate to logs for stopped tasks. - -### Related Links - -- [Traffic Capture Proxy Failure Modes](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficCaptureProxyServer/README.md#failure-modes) diff --git a/_migrations/migration-phases/setup-verification/index.md b/_migrations/migration-phases/setup-verification/index.md deleted file mode 100644 index 5126b7cab6..0000000000 --- a/_migrations/migration-phases/setup-verification/index.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: default -title: Setup verification -nav_order: 70 -has_children: true -parent: Migration phases ---- \ No newline at end of file diff --git a/_migrations/other-helpful-pages/load-sample-data-into-cluster.md b/_migrations/other-helpful-pages/load-sample-data-into-cluster.md deleted file mode 100644 index 6dbd9fdcb2..0000000000 --- a/_migrations/other-helpful-pages/load-sample-data-into-cluster.md +++ /dev/null @@ -1,144 +0,0 @@ - -This guide demonstrates how to quickly load test data into an Elasticsearch or OpenSearch source cluster using AWS Glue and the AWS Open Dataset library. We'll walk through indexing Bitcoin transaction data on the source cluster. For more details, refer to [the official AWS documentation on setting up Glue connections to OpenSearch](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-opensearch-home.html) ↗. - -## 1. Create Your Source Cluster - -Create your source cluster using the method of your choice, keeping in mind the following requirements: - -* We will use basic authentication (username/password) for access control. In this example, we are using Elasticsearch 7.10, but earlier versions of Elasticsearch and OpenSearch 1.X and 2.X are also supported. -* The source cluster must be in a VPC you control and have access to, enabling AWS Glue to send data to it. - -## 2. Create the Access Secret in Secrets Manager - -Create a secret in AWS Secrets Manager that provides AWS Glue with access to the source cluster’s basic authentication credentials: - -1. Navigate to the AWS Secrets Manager Console. -2. Create a generic secret. Name it as you prefer and configure replication/rotation as needed. - -Key fields: - -* `opensearch.net.http.auth.user`: The username for accessing the source cluster. -* `opensearch.net.http.auth.pass`: The password for accessing the source cluster. - -
- -Example Access Secrets - - -![Screenshot](https://github.com/user-attachments/assets/dde7e343-4a9c-4f0b-af6d-e7048ecd1b14) -
- -## 3. Create an AWS Glue IAM Role - -Create an IAM Role that grants AWS Glue the necessary permissions: - -1. Navigate to the IAM Console and create a new IAM Role. -2. Use the following trust policy to allow the AWS Glue service to assume the role: - -```json -{ - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Principal": { - "Service": "glue.amazonaws.com" - }, - "Action": "sts:AssumeRole" - } - ] -} -``` - -3. Attach the following permission sets: - * `AWSGlueServiceRole` - * `AmazonS3ReadOnlyAccess` - -4. Grant the role access to the secret created in step 2 by adding an inline policy like this: - -```json -{ - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Action": "secretsmanager:GetSecretValue", - "Resource": "arn:aws:secretsmanager:us-east-1:XXXXXXXXXXXX:secret:migration-assistant-source-cluster-creds-YDtnmx" - } - ] -} -``` - -## 4. Create an AWS Glue Connection - -Create a Glue Connection to provide AWS Glue with access to the source cluster: - -1. Navigate to the AWS Glue Console. -2. Create a new connection of the **Amazon OpenSearch Service** type. -3. Fill in your source cluster’s details (including VPC, subnet, and security group) and use the secret created earlier. - -
- -Example Glue Connection Configuration - - -![Screenshot](https://github.com/user-attachments/assets/b5978b2e-de58-4d46-ad47-ac960e729b89) -
- -## 5. (Optional) Examine the Source Dataset - -The sample dataset we'll use is the [AWS Public Blockchain dataset](https://registry.opendata.aws/aws-public-blockchain/) ↗, which is available for free. More information can be found [in this blog post](https://aws.amazon.com/blogs/database/access-bitcoin-and-ethereum-open-datasets-for-cross-chain-analytics/) ↗, and you can browse its contents in S3 [here](https://us-east-2.console.aws.amazon.com/s3/buckets/aws-public-blockchain) ↗. - -The Bitcoin transaction data we'll load into our source cluster is located at the S3 URI: `s3://aws-public-blockchain/v1.0/btc/transactions/`. - -## 6. Create the AWS Glue Job - -Now, create a Glue Job in the AWS Glue Console using the connection you created earlier. - -### S3 Source - -1. Set the S3 URI to `s3://aws-public-blockchain/v1.0/btc/transactions`. -2. Enable recursive reading of the bucket's contents. -3. The data format is Parquet. - -
- -Example Glue Connection - - -![Screenshot](https://github.com/user-attachments/assets/6fc4c0da-45b9-4c09-ba73-1619f59c9dd3) -
- -### OpenSearch Target - -1. Select the AWS Glue Connection you created. -2. Specify the index name where the Bitcoin transaction data will be stored. - -
- -Example Data Sink - - -![Screenshot](https://github.com/user-attachments/assets/264d0d17-f7f4-4c07-8567-6cae47c3ccd1) -
- -### Pre-Configure the Index Settings - -This is an optional step. By default, the Glue Job creates a single-shard index. Since the dataset is approximately 1 TB in size, it's recommended to pre-create the index with multiple shards. Follow this example to create an index with 40 shards: - -```bash -curl -u : -X PUT "http://:9200/bitcoin-data" -H 'Content-Type: application/json' -d' -{ - "settings": { - "number_of_shards": 40, - "number_of_replicas": 1 - } -} -' -``` - -You can also adjust any additional index settings at this time. - -## 7. Run the Glue Job - -Once the Glue source and target are configured, run the job in the AWS Console by clicking the **Run** button. You can monitor the job’s progress under the **Runs** tab in the console. \ No newline at end of file diff --git a/_migrations/other-helpful-pages/migration-timelines.md b/_migrations/other-helpful-pages/migration-timelines.md deleted file mode 100644 index 51d4302904..0000000000 --- a/_migrations/other-helpful-pages/migration-timelines.md +++ /dev/null @@ -1,105 +0,0 @@ -There is no *one-size-fits-most* migration strategy, this guide seeks to describe possible sample scenario(s) with the goal of helping customers plan their own migration strategy and estimate costs accordingly. - -## 15 Day Historical and Live Migration - -Key phases: - -1. Setup, Planning, and Verification (Days 1-5) -1. Historical backfill, Catchup, and Validation (Days 6-10) -1. Final Validation, Traffic Switchover, and Teardown (Days 11-15) - -### Timeline - -```mermaid -%%{ - init: { - "gantt": { - "fontSize": 20, - "barHeight": 40, - "sectionFontSize": 24, - "leftPadding": 175 - } - } -}%% -gantt - dateFormat D HH - axisFormat Day %d - todayMarker off - tickInterval 1day - - section Steps - Setup and Verification : prep, 1 00, 5d - Clear Test Environment : milestone, clear, after prep, 0d - Traffic Capture : traffic_capture, after clear, 6d - Snapshot : snapshot, after clear, 1d - Scale Up Target Cluster for Backfill : backfill_scale, 6 22, 2h - Metadata Migration : metadata, after snapshot, 1h - Reindex from Snapshot : rfs, after metadata, 71h - Scale Down Target Cluster for Replay : replay_scale, after rfs, 2h - Traffic Replay: replay, after replay_scale, 46h - Traffic Switchover : milestone, switchover, after replay, 0d - Validation : validation, after snapshot, 7d - Scale Down Target Cluster : 11 00, 2h - Teardown : teardown, 14 00, 2d -``` - -#### Explanation of Scaling Operations - -This section assumes a customer chooses to deliberatly scale their target cluster for backfill and/or replay to enable a faster and/or cheaper overall migration. In the absence of this, backfill and replay steps may take much longer (likely increasing overall cost). - -This plan assumes we can replay 6 days of captured data in under 2 days in order for the source and target clusters to be in sync. Take an example of a source cluster operating at avg. 90% CPU utilization to handle reads/writes from application code, it's improbable that a target cluster with the same scale and configuration will be able to support a request throughput of at least 3x in order to catchup in the given time. The same holds for backfill for write-heavy clusters or clusters where data has accumulated for a long time period, to follow this plan, the target cluster should be scaled such that it can ingest/index all the source data in under 3 days. - - -1. **Scale Up Target Cluster for Backfill**: Occurs after metadata migration and before reindexing. The target cluster is scaled up to handle the resource-intensive reindexing process faster. - - -2. **Scale Down Target Cluster for Replay**: Once the reindexing is complete, the target cluster is scaled down to a more appropriate size for the traffic replay phase. While still provisioned higher than normal production workloads, given replayer has a >1 speedup factor. - -3. **Scale Down Target Cluster**: After the validation phase, the target cluster is scaled down to its final operational size. This step ensures that the cluster is rightsized for normal production workloads, balancing performance needs with cost-efficiency. - -### Component Durations - -This component duration breakdown is useful for identifying the cost of resources deployed during the migration process. It provides a clear overview of how long each component is active or retained, which directly impacts resource utilization and associated costs. - -Note: Duration excludes weekends. If actual timeline extends over weekends, duration (and potentially costs) will increase. - -```mermaid -%%{ - init: { - "gantt": { - "fontSize": 20, - "barHeight": 40, - "sectionFontSize": 24, - "leftPadding": 175 - } - } -}%% -gantt - dateFormat D HH - axisFormat Day %d - todayMarker off - tickInterval 1day - - section Services - Core Services Runtime (15d) : active, 1 00, 15d - Capture Proxy Runtime (6d) : active, capture_active, 6 00, 6d - Capture Data Retention (4d) : after capture_active, 4d - Snapshot Runtime (1d) : active, snapshot_active, 6 00, 1d - Snapshot Retention (9d) : after snapshot_active, 9d - Reindex from Snapshot Runtime (3d) : active, historic_active, 7 01, 71h - Replayer Runtime (2d) : active, replayer_active, after historic_active, 2d - Replayer Data Retention (4d) : after replayer_active, 4d - Target Proxy Runtime (4d) : active, after replayer_active, 4d -``` - -| Component | Duration | -|-----------------------------------|----------| -| Core Services Runtime | 15d | -| Capture Proxy Runtime | 6d | -| Capture Data Retention | 4d | -| Snapshot Runtime | 1d | -| Snapshot Retention | 9d | -| Reindex from Snapshot Runtime | 3d | -| Replayer Runtime | 2d | -| Replayer Data Retention | 4d | -| Target Proxy Runtime | 4d | diff --git a/_migrations/other-helpful-pages/provisioning-source-cluster-for-testing.md b/_migrations/other-helpful-pages/provisioning-source-cluster-for-testing.md deleted file mode 100644 index 69b4e930e6..0000000000 --- a/_migrations/other-helpful-pages/provisioning-source-cluster-for-testing.md +++ /dev/null @@ -1,91 +0,0 @@ - -This guide walks you through the steps to provision an Elasticsearch cluster on EC2 using AWS CDK. The CDK that provisions this cluster can be found on the `migration-es` branch of the `opensearch-cluster-cdk` GitHub [forked repository](https://github.com/lewijacn/opensearch-cluster-cdk/tree/migration-es). - -TODO ^ lewijacn seems like it should be updated? - -## 1. Clone the Repository for Source Cluster CDK - -```bash -git clone https://github.com/lewijacn/opensearch-cluster-cdk.git -cd opensearch-cluster-cdk -git checkout migration-es -``` - -## 2. Install NPM Dependencies - -```bash -npm install -``` - -## 3. Configure AWS Credentials - -Configure the desired [AWS credentials](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_prerequisites) ↗ for the environment, as these will dictate the region and account used for deployment. - -## 4. Configure Cluster Options - -The configuration below sets up a single-node Elasticsearch 7.10.2 cluster on EC2 and a VPC to host the cluster. Alternatively, you can specify an existing VPC by providing the `vpcId` parameter. The setup includes an internal load balancer, which should be used when interacting with the cluster. - -Copy and paste the following configuration into a `cdk.context.json` file at the root of the repository. Replace the `` placeholders with the desired deployment stage, e.g., `dev`. - -```json -{ - "source-single-node-ec2": { - "suffix": "ec2-source-", - "networkStackSuffix": "ec2-source-", - "distVersion": "7.10.2", - "cidr": "12.0.0.0/16", - "distributionUrl": "https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.10.2-linux-x86_64.tar.gz", - "captureProxyEnabled": false, - "securityDisabled": true, - "minDistribution": false, - "cpuArch": "x64", - "isInternal": true, - "singleNodeCluster": true, - "networkAvailabilityZones": 2, - "dataNodeCount": 1, - "managerNodeCount": 0, - "serverAccessType": "ipv4", - "restrictServerAccessTo": "0.0.0.0/0" - } -} -``` - -> **Note:** You can specify other versions of Elasticsearch or OpenSearch by modifying the `distributionUrl` parameter. - -## 5. Bootstrap CDK in the Region (If Needed) - -If this is the first time you're deploying CDK in the region, you'll need to run the following command. **Note:** This only needs to be done once per region. - -```bash -cdk bootstrap --c contextId=source-single-node-ec2 --c contextFile=cdk.context.json -``` - -## 6. Deploy CloudFormation Stacks with CDK - -Deploy the infrastructure using the following command: - -```bash -cdk deploy "*" --c contextId=source-single-node-ec2 --c contextFile=cdk.context.json -``` - -Once the deployment is complete, the CDK will output the internal load balancer endpoint, which can be used within the VPC to interact with the Elasticsearch cluster: - -```bash -# Stack output -opensearch-infra-stack-ec2-source-dev.loadbalancerurl = opense-clust-owiejfo2345-sdfljsd.elb.us-east-1.amazonaws.com - -# Example curl command within the VPC -curl http://opense-clust-owiejfo2345-sdfljsd.elb.us-east-1.amazonaws.com:9200 -``` - -## 7. Clean Up Resources - -When you are done using the provisioned source cluster, you can delete the resources by running the following command: - -```bash -cdk destroy "*" --c contextId=source-single-node-ec2 --c contextFile=cdk.context.json -``` - -For a full list of options, refer to the CDK options in the [repository documentation](https://github.com/lewijacn/opensearch-cluster-cdk/tree/migration-es?tab=readme-ov-file#required-context-parameters). - -^ TODO: Are we advertising a fork? Seems like this should be fixed up \ No newline at end of file diff --git a/_migrations/quick-start-data-migration.md b/_migrations/quick-start-data-migration.md deleted file mode 100644 index 62b13292e7..0000000000 --- a/_migrations/quick-start-data-migration.md +++ /dev/null @@ -1,262 +0,0 @@ ---- -layout: default -title: Quickstart - Data migration -nav_order: 10 ---- - -# Quickstart - Data migration - -This document outlines how to deploy the Migration Assistant and execute an existing data migration using Reindex-from-Snapshot (RFS). Note that this does not include steps for deploying and capturing live traffic, which is necessary for a zero-downtime migration. Please refer to the "Phases of a Migration" section in the wiki navigation bar for a complete end-to-end migration process, including metadata migration, live capture, Reindex-from-Snapshot, and replay. - -## Prerequisites and Assumptions -* Verify your migration path [is supported](https://github.com/opensearch-project/opensearch-migrations/wiki/Is-Migration-Assistant-Right-for-You%3F#supported-migration-paths). Note that we test with the exact versions specified, but you should be able to migrate data on alternative minor versions as long as the major version is supported. -* Source cluster must be deployed with the S3 plugin. -* Target cluster must be deployed. -* A snapshot will be taken and stored in S3 in this guide, and the following assumptions are made about this snapshot: - * The `_source` flag is enabled on all indices to be migrated. - * The snapshot includes the global cluster state (`include_global_state` is `true`). - * Shard sizes up to approximately 80GB are supported. Larger shards will not be able to migrate. If this is a blocker, please consult the migrations team. -* Migration Assistant will be installed in the same region and have access to both the source snapshot and target cluster. - ---- - -## Step 1 - Installing Bootstrap EC2 Instance (~10 mins) -1. Log into the target AWS account where you want to deploy the Migration Assistant. -2. From the browser where you are logged into your target AWS account right-click [here](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?templateURL=https://solutions-reference.s3.amazonaws.com/migration-assistant-for-amazon-opensearch-service/latest/migration-assistant-for-amazon-opensearch-service.template&redirectId=SolutionWeb) ↗ to load the CloudFormation (Cfn) template from a new browser tab. -3. Follow the CloudFormation stack wizard: - * **Stack Name:** `MigrationBootstrap` - * **Stage Name:** `dev` - * Hit **Next** on each step, acknowledge on the fourth screen, and hit **Submit**. -4. Verify that the bootstrap stack exists and is set to `CREATE_COMPLETE`. This process takes around 10 minutes. - ---- - -## Step 2 - Setup Bootstrap Instance Access (~5 mins) -1. After deployment, find the EC2 instance ID for the `bootstrap-dev-instance`. -2. Create an IAM policy using the snippet below, replacing ``, ``, ``, and ``: - -```json -{ - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Action": "ssm:StartSession", - "Resource": [ - "arn:aws:ec2:::instance/", - "arn:aws:ssm:::document/SSM--BootstrapShell" - ] - } - ] -} -``` - -3. Name the policy, e.g., `SSM-OSMigrationBootstrapAccess`, and create the policy. - ---- - -## Step 3 - Login to Bootstrap and Build (~15 mins) -### Prerequisites: -* AWS CLI and AWS Session Manager Plugin installed. -* AWS credentials configured (`aws configure`). - -1. Load AWS credentials into your terminal. -2. Login to the instance using the command below, replacing `` and ``: -```bash -aws ssm start-session --document-name SSM-dev-BootstrapShell --target --region [--profile ] -``` -3. Once logged in, run the following command from the shell of the bootstrap instance (within the /opensearch-migrations directory): -```bash -./initBootstrap.sh && cd deployment/cdk/opensearch-service-migration -``` -4. After a successful build, remember the path for infrastructure deployment in the next step. - ---- - -## Step 4 - Configuring and Deploying for RFS Use Case (~20 mins) -1. Add the target cluster password to AWS Secrets Manager as an unstructured string. Be sure to copy the secret ARN for use during deployment. -2. From the same shell on the bootstrap instance, modify the cdk.context.json file located in the `/opensearch-migrations/deployment/cdk/opensearch-service-migration` directory: - -```json -{ - "migration-assistant": { - "vpcId": "", - "targetCluster": { - "endpoint": "", - "auth": { - "type": "basic", - "username": "", - "passwordFromSecretArn": "" - } - }, - "sourceCluster": { - "endpoint": "", - "auth": { - "type": "basic", - "username": "", - "passwordFromSecretArn": "" - } - }, - "reindexFromSnapshotExtraArgs": "", - "stage": "dev", - "otelCollectorEnabled": true, - "migrationConsoleServiceEnabled": true, - "reindexFromSnapshotServiceEnabled": true, - "migrationAssistanceEnabled": true - } -} -``` - -The source and target cluster authorization can be configured to have none, `basic` with a username and password, or `sigv4`. There are examples of each available [here](https://github.com/opensearch-project/opensearch-migrations/wiki/Configuration-Options#cluster-authentication-options). - -3. Bootstrap the account with the following command: -```bash -cdk bootstrap --c contextId=migration-assistant --require-approval never -``` -4. Deploy the stacks: -```bash -cdk deploy "*" --c contextId=migration-assistant --require-approval never --concurrency 5 -``` -5. Verify that all CloudFormation stacks were installed successfully. - -#### ReindexFromSnapshot Parameters -* If you're creating a snapshot using migration tooling, these parameters are auto-configured. If you're using an existing snapshot, modify `reindexFromSnapshotExtraArgs` with the following values: -```bash ---s3-repo-uri s3:/// --s3-region --snapshot-name -``` -Note, you will also need to give access to the migrationconsole and reindexFromSnapshot taskRole permissions to the bucket - ---- - -## Step 5 - Deploying the Migration Assistant -1. Bootstrap the account: -```bash -cdk bootstrap --c contextId=migration-assistant --require-approval never --concurrency 5 -``` -2. Deploy the stacks when `cdk.context.json` is fully configured: -```bash -cdk deploy "*" --c contextId=migration-assistant --require-approval never --concurrency 3 -``` - -### Stacks Deployed: -* Migration Assistant Network stack -* Reindex From Snapshot stack -* Migration Console stack - ---- - -## Step 6 - Accessing the Migration Console -Run the following command to access the migration console: -```bash -./accessContainer.sh migration-console dev -``` ->[!NOTE] ->`accessContainer.sh` is located in `/opensearch-migrations/deployment/cdk/opensearch-service-migration/` on the bootstrap instance. - -_Learn more [[Accessing the Migration Console]]_ - ---- - -## Step 7 - Checking Connection to Source & Target Clusters -To verify the connection to the clusters, run: -```bash -console clusters connection-check -``` - -### Expected Output: -* **Source Cluster:** Successfully connected! -* **Target Cluster:** Successfully connected! - -_Learn more [[Console commands reference|Migration-Console-commands-references]]_ - ---- - -## Step 8 - Snapshot Creation -Run the following to initiate creating a snapshot from the source cluster -``` -console snapshot create [...] -``` - -To check on the progress, -``` -console snapshot status [...] -``` -or, for more detail, -``` -console snapshot status --deep-check [...] -``` - -Wait for the snapshot to complete before moving to the next step. - -_Learn more [[Snapshot Creation Verification]] [[Snapshot Creation]]_ - ---- - -## Step 9 - Metadata Migration -Run the following command to migrate metadata: -```bash -console metadata migrate [...] -``` - -_Learn more [[Metadata Migration]]_ - ---- - -## Step 10 - RFS Document Migration -Start the backfill process: -```bash -console backfill start -``` - -Scale up the number of workers: -```bash -console backfill scale -``` - -Check the status: -```bash -console backfill status -``` - -To stop the workers: -```bash -console backfill stop -``` - -_Learn more [[Backfill Execution]]_ - ---- - -## Step 11 - Monitoring -Use the following command for detailed monitoring: -```bash -console backfill status --deep-check -``` - -### Example Output: -```text -BackfillStatus.RUNNING -Running=9 -Pending=1 -Desired=10 -Shards total: 62 -Shards completed: 46 -Shards incomplete: 16 -Shards in progress: 11 -Shards unclaimed: 5 -``` - -Logs and metrics are available in CloudWatch in the OpenSearchMigrations log group. - ---- - -## Step 12 - Verify all documents were migrated -Use the following query in CloudWatch Logs Insights to identify failed documents: -```bash -fields @message -| filter @message like "Bulk request succeeded, but some operations failed." -| sort @timestamp desc -| limit 10000 -``` - -_Learn more [[Backfill Result Validation]]_ \ No newline at end of file diff --git a/_ml-commons-plugin/remote-models/connectors.md b/_ml-commons-plugin/remote-models/connectors.md index 3ec6c73b07..788f1b003d 100644 --- a/_ml-commons-plugin/remote-models/connectors.md +++ b/_ml-commons-plugin/remote-models/connectors.md @@ -294,7 +294,7 @@ In some cases, you may need to update credentials, like `access_key`, that you u ```json PUT /_plugins/_ml/models/ { - "connector": { + "connectors": { "credential": { "openAI_key": "YOUR NEW OPENAI KEY" } diff --git a/_query-dsl/full-text/match-bool-prefix.md b/_query-dsl/full-text/match-bool-prefix.md index 3964dc5ee8..6905d49989 100644 --- a/_query-dsl/full-text/match-bool-prefix.md +++ b/_query-dsl/full-text/match-bool-prefix.md @@ -216,7 +216,7 @@ The `` accepts the following parameters. All parameters except `query` ar Parameter | Data type | Description :--- | :--- | :--- `query` | String | The text, number, Boolean value, or date to use for search. Required. -`analyzer` | String | The [analyzer]({{site.url}}{{site.baseurl}}/analyzers/index/) used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. +`analyzer` | String | The [analyzer]({{site.url}}{{site.baseurl}}/analyzers/index/) used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. For more information about `index.query.default_field`, see [Dynamic index-level index settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index-settings/#dynamic-index-level-index-settings). `fuzziness` | `AUTO`, `0`, or a positive integer | The number of character edits (insert, delete, substitute) that it takes to change one word to another when determining whether a term matched a value. For example, the distance between `wined` and `wind` is 1. The default, `AUTO`, chooses a value based on the length of each term and is a good choice for most use cases. `fuzzy_rewrite` | String | Determines how OpenSearch rewrites the query. Valid values are `constant_score`, `scoring_boolean`, `constant_score_boolean`, `top_terms_N`, `top_terms_boost_N`, and `top_terms_blended_freqs_N`. If the `fuzziness` parameter is not `0`, the query uses a `fuzzy_rewrite` method of `top_terms_blended_freqs_${max_expansions}` by default. Default is `constant_score`. `fuzzy_transpositions` | Boolean | Setting `fuzzy_transpositions` to `true` (default) adds swaps of adjacent characters to the insert, delete, and substitute operations of the `fuzziness` option. For example, the distance between `wind` and `wnid` is 1 if `fuzzy_transpositions` is true (swap "n" and "i") and 2 if it is false (delete "n", insert "n"). If `fuzzy_transpositions` is false, `rewind` and `wnid` have the same distance (2) from `wind`, despite the more human-centric opinion that `wnid` is an obvious typo. The default is a good choice for most use cases. diff --git a/_query-dsl/full-text/match-phrase.md b/_query-dsl/full-text/match-phrase.md index 747c4814d9..3f36465790 100644 --- a/_query-dsl/full-text/match-phrase.md +++ b/_query-dsl/full-text/match-phrase.md @@ -268,6 +268,6 @@ The `` accepts the following parameters. All parameters except `query` ar Parameter | Data type | Description :--- | :--- | :--- `query` | String | The query string to use for search. Required. -`analyzer` | String | The [analyzer]({{site.url}}{{site.baseurl}}/analyzers/index/) used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. +`analyzer` | String | The [analyzer]({{site.url}}{{site.baseurl}}/analyzers/index/) used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. For more information about `index.query.default_field`, see [Dynamic index-level index settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index-settings/#dynamic-index-level-index-settings). `slop` | `0` (default) or a positive integer | Controls the degree to which words in a query can be misordered and still be considered a match. From the [Lucene documentation](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop--): "The number of other words permitted between words in query phrase. For example, to switch the order of two words requires two moves (the first move places the words atop one another), so to permit reorderings of phrases, the slop must be at least two. A value of zero requires an exact match." `zero_terms_query` | String | In some cases, the analyzer removes all terms from a query string. For example, the `stop` analyzer removes all terms from the string `an but this`. In those cases, `zero_terms_query` specifies whether to match no documents (`none`) or all documents (`all`). Valid values are `none` and `all`. Default is `none`. \ No newline at end of file diff --git a/_query-dsl/full-text/match.md b/_query-dsl/full-text/match.md index 056ef76890..5ece14e127 100644 --- a/_query-dsl/full-text/match.md +++ b/_query-dsl/full-text/match.md @@ -451,7 +451,7 @@ Parameter | Data type | Description :--- | :--- | :--- `query` | String | The query string to use for search. Required. `auto_generate_synonyms_phrase_query` | Boolean | Specifies whether to create a [match phrase query]({{site.url}}{{site.baseurl}}/query-dsl/full-text/match-phrase/) automatically for multi-term synonyms. For example, if you specify `ba,batting average` as synonyms and search for `ba`, OpenSearch searches for `ba OR "batting average"` (if this option is `true`) or `ba OR (batting AND average)` (if this option is `false`). Default is `true`. -`analyzer` | String | The [analyzer]({{site.url}}{{site.baseurl}}/analyzers/index/) used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. +`analyzer` | String | The [analyzer]({{site.url}}{{site.baseurl}}/analyzers/index/) used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. For more information about `index.query.default_field`, see [Dynamic index-level index settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index-settings/#dynamic-index-level-index-settings). `boost` | Floating-point | Boosts the clause by the given multiplier. Useful for weighing clauses in compound queries. Values in the [0, 1) range decrease relevance, and values greater than 1 increase relevance. Default is `1`. `enable_position_increments` | Boolean | When `true`, resulting queries are aware of position increments. This setting is useful when the removal of stop words leaves an unwanted "gap" between terms. Default is `true`. `fuzziness` | String | The number of character edits (insertions, deletions, substitutions, or transpositions) that it takes to change one word to another when determining whether a term matched a value. For example, the distance between `wined` and `wind` is 1. Valid values are non-negative integers or `AUTO`. The default, `AUTO`, chooses a value based on the length of each term and is a good choice for most use cases. diff --git a/_query-dsl/full-text/multi-match.md b/_query-dsl/full-text/multi-match.md index ab1496fdd3..a3995df714 100644 --- a/_query-dsl/full-text/multi-match.md +++ b/_query-dsl/full-text/multi-match.md @@ -900,9 +900,9 @@ Parameter | Data type | Description :--- | :--- | :--- `query` | String | The query string to use for search. Required. `auto_generate_synonyms_phrase_query` | Boolean | Specifies whether to create a [match phrase query]({{site.url}}{{site.baseurl}}/query-dsl/full-text/match-phrase/) automatically for multi-term synonyms. For example, if you specify `ba,batting average` as synonyms and search for `ba`, OpenSearch searches for `ba OR "batting average"` (if this option is `true`) or `ba OR (batting AND average)` (if this option is `false`). Default is `true`. -`analyzer` | String | The [analyzer]({{site.url}}{{site.baseurl}}/analyzers/index/) used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. +`analyzer` | String | The [analyzer]({{site.url}}{{site.baseurl}}/analyzers/index/) used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. For more information about `index.query.default_field`, see [Dynamic index-level index settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index-settings/#dynamic-index-level-index-settings). `boost` | Floating-point | Boosts the clause by the given multiplier. Useful for weighing clauses in compound queries. Values in the [0, 1) range decrease relevance, and values greater than 1 increase relevance. Default is `1`. -`fields` | Array of strings | The list of fields in which to search. If you don't provide the `fields` parameter, `multi_match` query searches the fields specified in the `index.query. Default_field` setting, which defaults to `*`. +`fields` | Array of strings | The list of fields in which to search. If you don't provide the `fields` parameter, `multi_match` query searches the fields specified in the `index.query.default_field` setting, which defaults to `*`. `fuzziness` | String | The number of character edits (insert, delete, substitute) that it takes to change one word to another when determining whether a term matched a value. For example, the distance between `wined` and `wind` is 1. Valid values are non-negative integers or `AUTO`. The default, `AUTO`, chooses a value based on the length of each term and is a good choice for most use cases. Not supported for `phrase`, `phrase_prefix`, and `cross_fields` queries. `fuzzy_rewrite` | String | Determines how OpenSearch rewrites the query. Valid values are `constant_score`, `scoring_boolean`, `constant_score_boolean`, `top_terms_N`, `top_terms_boost_N`, and `top_terms_blended_freqs_N`. If the `fuzziness` parameter is not `0`, the query uses a `fuzzy_rewrite` method of `top_terms_blended_freqs_${max_expansions}` by default. Default is `constant_score`. `fuzzy_transpositions` | Boolean | Setting `fuzzy_transpositions` to `true` (default) adds swaps of adjacent characters to the insert, delete, and substitute operations of the `fuzziness` option. For example, the distance between `wind` and `wnid` is 1 if `fuzzy_transpositions` is true (swap "n" and "i") and 2 if it is false (delete "n", insert "n"). If `fuzzy_transpositions` is false, `rewind` and `wnid` have the same distance (2) from `wind`, despite the more human-centric opinion that `wnid` is an obvious typo. The default is a good choice for most use cases. diff --git a/_query-dsl/full-text/query-string.md b/_query-dsl/full-text/query-string.md index 47180e3f6d..7b3343155d 100644 --- a/_query-dsl/full-text/query-string.md +++ b/_query-dsl/full-text/query-string.md @@ -623,7 +623,7 @@ Parameter | Data type | Description `query` | String | The text that may contain expressions in the [query string syntax](#query-string-syntax) to use for search. Required. `allow_leading_wildcard` | Boolean | Specifies whether `*` and `?` are allowed as first characters of a search term. Default is `true`. `analyze_wildcard` | Boolean | Specifies whether OpenSearch should attempt to analyze wildcard terms. Default is `false`. -`analyzer` | String | The [analyzer]({{site.url}}{{site.baseurl}}/analyzers/index/) used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. +`analyzer` | String | The [analyzer]({{site.url}}{{site.baseurl}}/analyzers/index/) used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. For more information about `index.query.default_field`, see [Dynamic index-level index settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index-settings/#dynamic-index-level-index-settings). `auto_generate_synonyms_phrase_query` | Boolean | Specifies whether to create a [match phrase query]({{site.url}}{{site.baseurl}}/query-dsl/full-text/match-phrase/) automatically for multi-term synonyms. For example, if you specify `ba, batting average` as synonyms and search for `ba`, OpenSearch searches for `ba OR "batting average"` (if this option is `true`) or `ba OR (batting AND average)` (if this option is `false`). Default is `true`. `boost` | Floating-point | Boosts the clause by the given multiplier. Useful for weighing clauses in compound queries. Values in the [0, 1) range decrease relevance, and values greater than 1 increase relevance. Default is `1`. `default_field` | String | The field in which to search if the field is not specified in the query string. Supports wildcards. Defaults to the value specified in the `index.query. Default_field` index setting. By default, the `index.query. Default_field` is `*`, which means extract all fields eligible for term query and filter the metadata fields. The extracted fields are combined into a query if the `prefix` is not specified. Eligible fields do not include nested documents. Searching all eligible fields could be a resource-intensive operation. The `indices.query.bool.max_clause_count` search setting defines the maximum value for the product of the number of fields and the number of terms that can be queried at one time. The default value for `indices.query.bool.max_clause_count` is 1,024. diff --git a/_query-dsl/full-text/simple-query-string.md b/_query-dsl/full-text/simple-query-string.md index 5dd2462e9a..1624efdaa7 100644 --- a/_query-dsl/full-text/simple-query-string.md +++ b/_query-dsl/full-text/simple-query-string.md @@ -355,7 +355,7 @@ Parameter | Data type | Description :--- | :--- | :--- `query`| String | The text that may contain expressions in the [simple query string syntax](#simple-query-string-syntax) to use for search. Required. `analyze_wildcard` | Boolean | Specifies whether OpenSearch should attempt to analyze wildcard terms. Default is `false`. -`analyzer` | String | The analyzer used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. +`analyzer` | String | The analyzer used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index. For more information about `index.query.default_field`, see [Dynamic index-level index settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index-settings/#dynamic-index-level-index-settings). `auto_generate_synonyms_phrase_query` | Boolean | Specifies whether to create [match_phrase queries]({{site.url}}{{site.baseurl}}/query-dsl/full-text/match/) automatically for multi-term synonyms. Default is `true`. `default_operator`| String | If the query string contains multiple search terms, whether all terms need to match (`AND`) or only one term needs to match (`OR`) for a document to be considered a match. Valid values are:
- `OR`: The string `to be` is interpreted as `to OR be`
- `AND`: The string `to be` is interpreted as `to AND be`
Default is `OR`. `fields` | String array | The list of fields to search (for example, `"fields": ["title^4", "description"]`). Supports wildcards. If unspecified, defaults to the `index.query.default_field` setting, which defaults to `["*"]`. The maximum number of fields that can be searched at the same time is defined by `indices.query.bool.max_clause_count`, which is 1,024 by default. diff --git a/_search-plugins/knn/painless-functions.md b/_search-plugins/knn/painless-functions.md index 7a8d9fec7b..4b2311ad65 100644 --- a/_search-plugins/knn/painless-functions.md +++ b/_search-plugins/knn/painless-functions.md @@ -51,7 +51,7 @@ The following table describes the available painless functions the k-NN plugin p Function name | Function signature | Description :--- | :--- l2Squared | `float l2Squared (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors. -l1Norm | `float l1Norm (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors. +l1Norm | `float l1Norm (float[] queryVector, doc['vector field'])` | This function calculates the L1 Norm distance (Manhattan distance) between a given query vector and document vectors. cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector field'])` | Cosine similarity is an inner product of the query vector and document vector normalized to both have a length of 1. If the magnitude of the query vector doesn't change throughout the query, you can pass the magnitude of the query vector to improve performance, instead of calculating the magnitude every time for every filtered document:
`float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)`
In general, the range of cosine similarity is [-1, 1]. However, in the case of information retrieval, the cosine similarity of two documents ranges from 0 to 1 because the tf-idf statistic can't be negative. Therefore, the k-NN plugin adds 1.0 in order to always yield a positive cosine similarity score. hamming | `float hamming (float[] queryVector, doc['vector field'])` | This function calculates the Hamming distance between a given query vector and document vectors. The Hamming distance is the number of positions at which the corresponding elements are different. The shorter the distance, the more relevant the document is, so this example inverts the return value of the Hamming distance. @@ -73,4 +73,4 @@ The `hamming` space type is supported for binary vectors in OpenSearch version 2 Because scores can only be positive, this script ranks documents with vector fields higher than those without. With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests containing the zero vector will be rejected, and a corresponding exception will be thrown. -{: .note } \ No newline at end of file +{: .note } diff --git a/_search-plugins/ltr/training-models.md b/_search-plugins/ltr/training-models.md index 7bee5d10ac..fb068cedd7 100644 --- a/_search-plugins/ltr/training-models.md +++ b/_search-plugins/ltr/training-models.md @@ -16,10 +16,10 @@ The feature logging process generates a RankLib-comsumable judgment file. In the score) and 2 (a description `TF*IDF` score) for a set of documents: ``` -4 qid:1 1:9.8376875 2:12.318446 # 7555 rambo -3 qid:1 1:10.7808075 2:9.510193 # 1370 rambo -3 qid:1 1:10.7808075 2:6.8449354 # 1369 rambo -3 qid:1 1:10.7808075 2:0.0 # 1368 rambo +4 qid:1 1:9.8376875 2:12.318446 # 7555 rambo +3 qid:1 1:10.7808075 2:9.510193 # 1370 rambo +3 qid:1 1:10.7808075 2:6.8449354 # 1369 rambo +3 qid:1 1:10.7808075 2:0.0 # 1368 rambo ``` The RankLib library can be called using the following command: @@ -49,6 +49,29 @@ RankLib outputs the model in its own serialization format. As shown in the follo Within the RankLib model, each tree in the ensemble examines feature values, makes decisions based on these feature values, and outputs the relevance scores. The features are referred to by their ordinal position, starting from 1, which corresponds to the 0th feature in the original feature set. RankLib does not use feature names during model training. +### Other RankLib models + +RankLib is a library that implements several other model types in addition to LambdaMART, such as MART, +RankNet, RankBoost, AdaRank, Coordinate Ascent, ListNet, and Random Forests. Each of these models has its own set of parameters and training process. + +For example, the RankNet model is a neural network that learns to predict the probability of a document being more relevant than another document. The model is trained using a pairwise loss function that compares the predicted relevance of two documents with the actual relevance. The model is serialized in a format similar to the following example: + +``` +## RankNet +## Epochs = 100 +## No. of features = 5 +## No. of hidden layers = 1 +... +## Layer 1: 10 neurons +1 2 +1 +10 +0 0 -0.013491530393429608 0.031183180961270988 0.06558792020112071 -0.006024092627087733 0.05729619574181734 -0.0017010373987742411 0.07684848696852313 -0.06570387602230028 0.04390491141617467 0.013371636736099578 +... +``` + +All these models can be used with the Learning to Rank plugin, provided that the model is serialized in the RankLib format. + ## XGBoost model training Unlike the RankLib model, the XGBoost model is serialized in a format specific to gradient-boosted decision trees, as shown in the following example: diff --git a/_search-plugins/searching-data/index.md b/_search-plugins/searching-data/index.md index 279958d97c..42ce7654a0 100644 --- a/_search-plugins/searching-data/index.md +++ b/_search-plugins/searching-data/index.md @@ -19,4 +19,4 @@ Feature | Description [Sort results]({{site.url}}{{site.baseurl}}/opensearch/search/sort/) | Allow sorting of results by different criteria. [Highlight query matches]({{site.url}}{{site.baseurl}}/opensearch/search/highlight/) | Highlight the search term in the results. [Retrieve inner hits]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/inner-hits/) | Retrieve underlying hits in nested and parent-join objects. -[Retrieve specific fields]({{site.url}}{{site.baseurl}}search-plugins/searching-data/retrieve-specific-fields/) | Retrieve only the specific fields +[Retrieve specific fields]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/retrieve-specific-fields/) | Retrieve only the specific fields diff --git a/_tools/k8s-operator.md b/_tools/k8s-operator.md index 7ee1c1adee..5027dcf304 100644 --- a/_tools/k8s-operator.md +++ b/_tools/k8s-operator.md @@ -63,40 +63,40 @@ Then install the OpenSearch Kubernetes Operator using the following steps: 3. Enter `make build manifests`. 4. Start a Kubernetes cluster. When using minikube, open a new terminal window and enter `minikube start`. Kubernetes will now use a containerized minikube cluster with a namespace called `default`. Make sure that `~/.kube/config` points to the cluster. -```yml -apiVersion: v1 -clusters: -- cluster: - certificate-authority: /Users/naarcha/.minikube/ca.crt - extensions: - - extension: - last-update: Mon, 29 Aug 2022 10:11:47 CDT - provider: minikube.sigs.k8s.io - version: v1.26.1 - name: cluster_info - server: https://127.0.0.1:61661 - name: minikube -contexts: -- context: - cluster: minikube - extensions: - - extension: - last-update: Mon, 29 Aug 2022 10:11:47 CDT - provider: minikube.sigs.k8s.io - version: v1.26.1 - name: context_info - namespace: default - user: minikube - name: minikube -current-context: minikube -kind: Config -preferences: {} -users: -- name: minikube - user: - client-certificate: /Users/naarcha/.minikube/profiles/minikube/client.crt - client-key: /Users/naarcha/.minikube/profiles/minikube/client.key -``` + ```yml + apiVersion: v1 + clusters: + - cluster: + certificate-authority: /Users/naarcha/.minikube/ca.crt + extensions: + - extension: + last-update: Mon, 29 Aug 2022 10:11:47 CDT + provider: minikube.sigs.k8s.io + version: v1.26.1 + name: cluster_info + server: https://127.0.0.1:61661 + name: minikube + contexts: + - context: + cluster: minikube + extensions: + - extension: + last-update: Mon, 29 Aug 2022 10:11:47 CDT + provider: minikube.sigs.k8s.io + version: v1.26.1 + name: context_info + namespace: default + user: minikube + name: minikube + current-context: minikube + kind: Config + preferences: {} + users: + - name: minikube + user: + client-certificate: /Users/naarcha/.minikube/profiles/minikube/client.crt + client-key: /Users/naarcha/.minikube/profiles/minikube/client.key + ``` 5. Enter `make install` to create the CustomResourceDefinition that runs in your Kubernetes cluster. 6. Start the OpenSearch Kubernetes Operator. Enter `make run`. @@ -146,4 +146,4 @@ kubectl delete -f opensearch-cluster.yaml To learn more about how to customize your Kubernetes OpenSearch cluster, including data persistence, authentication methods, and scaling, see the [OpenSearch Kubernetes Operator User Guide](https://github.com/Opster/opensearch-k8s-operator/blob/main/docs/userguide/main.md). -If you want to contribute to the development of the OpenSearch Kubernetes Operator, see the repo [design documents](https://github.com/Opster/opensearch-k8s-operator/blob/main/docs/designs/high-level.md). \ No newline at end of file +If you want to contribute to the development of the OpenSearch Kubernetes Operator, see the repo [design documents](https://github.com/Opster/opensearch-k8s-operator/blob/main/docs/designs/high-level.md). diff --git a/_tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot.md b/_tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot.md index d13955f3f0..7076c792e2 100644 --- a/_tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot.md +++ b/_tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot.md @@ -46,7 +46,7 @@ services: - node.search.cache.size=50gb ``` - +- Starting with version 2.18, k-NN indexes support searchable snapshots for the NMSLIB and Faiss engines. ## Create a searchable snapshot index @@ -109,4 +109,3 @@ The following are known limitations of the searchable snapshots feature: - Searching remote data can impact the performance of other queries running on the same node. We recommend that users provision dedicated nodes with the `search` role for performance-critical applications. - For better search performance, consider [force merging]({{site.url}}{{site.baseurl}}/api-reference/index-apis/force-merge/) indexes into a smaller number of segments before taking a snapshot. For the best performance, at the cost of using compute resources prior to snapshotting, force merge your index into one segment. - We recommend configuring a maximum ratio of remote data to local disk cache size using the `cluster.filecache.remote_data_ratio` setting. A ratio of 5 is a good starting point for most workloads to ensure good query performance. If the ratio is too large, then there may not be sufficient disk space to handle the search workload. For more details on the maximum ratio of remote data, see issue [#11676](https://github.com/opensearch-project/OpenSearch/issues/11676). -- k-NN native-engine-based indexes using `faiss` and `nmslib` engines are incompatible with searchable snapshots. diff --git a/_tuning-your-cluster/availability-and-recovery/workload-management/query-group-lifecycle-api.md b/_tuning-your-cluster/availability-and-recovery/workload-management/query-group-lifecycle-api.md index 83d66c9eb8..0fb0b0b65c 100644 --- a/_tuning-your-cluster/availability-and-recovery/workload-management/query-group-lifecycle-api.md +++ b/_tuning-your-cluster/availability-and-recovery/workload-management/query-group-lifecycle-api.md @@ -1,6 +1,6 @@ --- layout: default -title: Query group lifecycle API +title: Query Group Lifecycle API nav_order: 20 parent: Workload management grand_parent: Availability and recovery @@ -8,9 +8,9 @@ grand_parent: Availability and recovery # Query Group Lifecycle API -The Query Group Lifecycle API in creates, updates, retrieves, and deletes query groups. The API categorizes queries into specific groups, called _query groups_ based on desired resource limits. +The Query Group Lifecycle API creates, updates, retrieves, and deletes query groups. The API categorizes queries into specific groups, called _query groups_, based on desired resource limits. -## Paths and HTTP method +## Path and HTTP methods ```json PUT _wlm/query_group @@ -33,7 +33,7 @@ When creating a query group, make sure that the sum of the resource limits for a ## Example requests -The following requests show how to use the Query Group Lifecycle API. +The following example requests show how to use the Query Group Lifecycle API. ### Creating a query group @@ -48,6 +48,7 @@ PUT _wlm/query_group } } ``` +{% include copy-curl.html %} ### Updating a query group @@ -61,18 +62,21 @@ PUT _wlm/query_group/analytics } } ``` +{% include copy-curl.html %} ### Getting a query group ```json GET _wlm/query_group/analytics ``` +{% include copy-curl.html %} ### Deleting a query group ```json DELETE _wlm/query_group/analytics ``` +{% include copy-curl.html %} ## Example responses @@ -93,7 +97,7 @@ OpenSearch returns responses similar to the following. } ``` -### Updating query group +### Updating a query group ```json { diff --git a/_upgrade-to/index.md b/_upgrade-to/index.md index 0eea3d6209..696be88c21 100644 --- a/_upgrade-to/index.md +++ b/_upgrade-to/index.md @@ -1,6 +1,6 @@ --- layout: default -title: About the migration process +title: Upgrading OpenSearch nav_order: 1 nav_exclude: true permalink: /upgrade-to/ @@ -8,15 +8,14 @@ redirect_from: - /upgrade-to/index/ --- -# About the migration process +# Upgrading OpenSearch -The process of migrating from Elasticsearch OSS to OpenSearch varies depending on your current version of Elasticsearch OSS, installation type, tolerance for downtime, and cost-sensitivity. Rather than concrete steps to cover every situation, we have general guidance for the process. +The process of upgrading your OpenSearch version varies depending on your current version of OpenSearch, installation type, tolerance for downtime, and cost-sensitivity. For migrating to OpenSearch, we provide a [Migration Assistant]({{site.url}}{{site.baseurl}}/migration-assistant/). -Three approaches exist: +Two upgrade approaches exists: -- Use a snapshot to [migrate your Elasticsearch OSS data]({{site.url}}{{site.baseurl}}/upgrade-to/snapshot-migrate/) to a new OpenSearch cluster. This method may incur downtime. -- Perform a [restart upgrade or a rolling upgrade]({{site.url}}{{site.baseurl}}/upgrade-to/upgrade-to/) on your existing nodes. A restart upgrade involves upgrading the entire cluster and restarting it, whereas a rolling upgrade requires upgrading and restarting nodes in the cluster one by one. -- Replace existing Elasticsearch OSS nodes with new OpenSearch nodes. Node replacement is most popular when upgrading [Docker clusters]({{site.url}}{{site.baseurl}}/upgrade-to/docker-upgrade-to/). +- Perform a [restart upgrade or a rolling upgrade]({{site.url}}{{site.baseurl}}/upgrade-to/snapshot-migrate/) on your existing nodes. A restart upgrade involves upgrading the entire cluster and restarting it, whereas a rolling upgrade requires upgrading and restarting nodes in the cluster one by one. +- Replace existing OpenSearch nodes with new OpenSearch nodes. Node replacement is most popular when upgrading [Docker clusters]({{site.url}}{{site.baseurl}}/upgrade-to/docker-upgrade-to/). Regardless of your approach, to safeguard against data loss, we recommend that you take a [snapshot]({{site.url}}{{site.baseurl}}/opensearch/snapshots/snapshot-restore) of all indexes prior to any migration. diff --git a/_upgrade-to/upgrade-to.md b/_upgrade-to/upgrade-to.md index 340055b214..00950687a5 100644 --- a/_upgrade-to/upgrade-to.md +++ b/_upgrade-to/upgrade-to.md @@ -6,6 +6,10 @@ nav_order: 15 # Migrating from Elasticsearch OSS to OpenSearch + +OpenSearch provides a [Migration Assistant]({{site.url}}{{site.baseurl}}/migration-assistant/) to assist you in migrating from other search solutions. +{: .warning} + If you want to migrate from an existing Elasticsearch OSS cluster to OpenSearch and find the [snapshot approach]({{site.url}}{{site.baseurl}}/upgrade-to/snapshot-migrate/) unappealing, you can migrate your existing nodes from Elasticsearch OSS to OpenSearch. If your existing cluster runs an older version of Elasticsearch OSS, the first step is to upgrade to version 6.x or 7.x. diff --git a/images/migrations/migrations-architecture-overview.png b/images/migrations/migrations-architecture-overview.png new file mode 100644 index 0000000000..3002da3a87 Binary files /dev/null and b/images/migrations/migrations-architecture-overview.png differ