diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
index 815687fa17..61a535fe91 100644
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@@ -1 +1 @@
-* @kolchfa-aws @Naarcha-AWS @vagimeli @AMoo-Miki @natebower @dlvenable @stephen-crawford @epugh
+* @kolchfa-aws @Naarcha-AWS @AMoo-Miki @natebower @dlvenable @epugh
diff --git a/.github/workflows/pr_checklist.yml b/.github/workflows/pr_checklist.yml
index b56174793e..e34d0cecb2 100644
--- a/.github/workflows/pr_checklist.yml
+++ b/.github/workflows/pr_checklist.yml
@@ -29,7 +29,7 @@ jobs:
with:
script: |
let assignee = context.payload.pull_request.user.login;
- const prOwners = ['Naarcha-AWS', 'kolchfa-aws', 'vagimeli', 'natebower'];
+ const prOwners = ['Naarcha-AWS', 'kolchfa-aws', 'natebower'];
if (!prOwners.includes(assignee)) {
assignee = 'kolchfa-aws'
@@ -40,4 +40,4 @@ jobs:
owner: context.repo.owner,
repo: context.repo.repo,
assignees: [assignee]
- });
\ No newline at end of file
+ });
diff --git a/.ruby-version b/.ruby-version
deleted file mode 100644
index 4772543317..0000000000
--- a/.ruby-version
+++ /dev/null
@@ -1 +0,0 @@
-3.3.2
diff --git a/MAINTAINERS.md b/MAINTAINERS.md
index 55b908e027..b06d367e21 100644
--- a/MAINTAINERS.md
+++ b/MAINTAINERS.md
@@ -9,14 +9,14 @@ This document lists the maintainers in this repo. See [opensearch-project/.githu
| Fanit Kolchina | [kolchfa-aws](https://github.com/kolchfa-aws) | Amazon |
| Nate Archer | [Naarcha-AWS](https://github.com/Naarcha-AWS) | Amazon |
| Nathan Bower | [natebower](https://github.com/natebower) | Amazon |
-| Melissa Vagi | [vagimeli](https://github.com/vagimeli) | Amazon |
| Miki Barahmand | [AMoo-Miki](https://github.com/AMoo-Miki) | Amazon |
| David Venable | [dlvenable](https://github.com/dlvenable) | Amazon |
-| Stephen Crawford | [stephen-crawford](https://github.com/stephen-crawford) | Amazon |
| Eric Pugh | [epugh](https://github.com/epugh) | OpenSource Connections |
## Emeritus
-| Maintainer | GitHub ID | Affiliation |
-| ---------------- | ----------------------------------------------- | ----------- |
-| Heather Halter | [hdhalter](https://github.com/hdhalter) | Amazon |
+| Maintainer | GitHub ID | Affiliation |
+| ---------------- | ------------------------------------------------------- | ----------- |
+| Heather Halter | [hdhalter](https://github.com/hdhalter) | Amazon |
+| Melissa Vagi | [vagimeli](https://github.com/vagimeli) | Amazon |
+| Stephen Crawford | [stephen-crawford](https://github.com/stephen-crawford) | Amazon |
\ No newline at end of file
diff --git a/README.md b/README.md
index 52321335c7..807e106309 100644
--- a/README.md
+++ b/README.md
@@ -24,7 +24,6 @@ If you encounter problems or have questions when contributing to the documentati
- [kolchfa-aws](https://github.com/kolchfa-aws)
- [Naarcha-AWS](https://github.com/Naarcha-AWS)
-- [vagimeli](https://github.com/vagimeli)
## Code of conduct
diff --git a/_about/version-history.md b/_about/version-history.md
index bfbd8e9f55..d1cf98c178 100644
--- a/_about/version-history.md
+++ b/_about/version-history.md
@@ -34,6 +34,7 @@ OpenSearch version | Release highlights | Release date
[2.0.1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.1.md) | Includes bug fixes and maintenance updates for Alerting and Anomaly Detection. | 16 June 2022
[2.0.0](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0.md) | Includes document-level monitors for alerting, OpenSearch Notifications plugins, and Geo Map Tiles in OpenSearch Dashboards. Also adds support for Lucene 9 and bug fixes for all OpenSearch plugins. For a full list of release highlights, see the Release Notes. | 26 May 2022
[2.0.0-rc1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0-rc1.md) | The Release Candidate for 2.0.0. This version allows you to preview the upcoming 2.0.0 release before the GA release. The preview release adds document-level alerting, support for Lucene 9, and the ability to use term lookup queries in document level security. | 03 May 2022
+[1.3.20](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.20.md) | Includes enhancements to Anomaly Detection Dashboards, bug fixes for Alerting and Dashboards Reports, and maintenance updates for several OpenSearch components. | 11 December 2024
[1.3.19](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.19.md) | Includes bug fixes and maintenance updates for OpenSearch security, OpenSearch security Dashboards, and anomaly detection. | 27 August 2024
[1.3.18](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.18.md) | Includes maintenance updates for OpenSearch security. | 16 July 2024
[1.3.17](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.17.md) | Includes maintenance updates for OpenSearch security and OpenSearch Dashboards security. | 06 June 2024
diff --git a/_analyzers/custom-analyzer.md b/_analyzers/custom-analyzer.md
new file mode 100644
index 0000000000..c456f3d826
--- /dev/null
+++ b/_analyzers/custom-analyzer.md
@@ -0,0 +1,312 @@
+---
+layout: default
+title: Creating a custom analyzer
+nav_order: 40
+parent: Analyzers
+---
+
+# Creating a custom analyzer
+
+To create a custom analyzer, specify a combination of the following components:
+
+- Character filters (zero or more)
+
+- Tokenizer (one)
+
+- Token filters (zero or more)
+
+## Configuration
+
+The following parameters can be used to configure a custom analyzer.
+
+| Parameter | Required/Optional | Description |
+|:--- | :--- | :--- |
+| `type` | Optional | The analyzer type. Default is `custom`. You can also specify a prebuilt analyzer using this parameter. |
+| `tokenizer` | Required | A tokenizer to be included in the analyzer. |
+| `char_filter` | Optional | A list of character filters to be included in the analyzer. |
+| `filter` | Optional | A list of token filters to be included in the analyzer. |
+| `position_increment_gap` | Optional | The extra spacing applied between values when indexing text fields that have multiple values. For more information, see [Position increment gap](#position-increment-gap). Default is `100`. |
+
+## Examples
+
+The following examples demonstrate various custom analyzer configurations.
+
+### Custom analyzer with a character filter for HTML stripping
+
+The following example analyzer removes HTML tags from text before tokenization:
+
+```json
+PUT simple_html_strip_analyzer_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "html_strip_analyzer": {
+ "type": "custom",
+ "char_filter": ["html_strip"],
+ "tokenizer": "whitespace",
+ "filter": ["lowercase"]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET simple_html_strip_analyzer_index/_analyze
+{
+ "analyzer": "html_strip_analyzer",
+ "text": "
OpenSearch is awesome!
"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "opensearch",
+ "start_offset": 3,
+ "end_offset": 13,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "is",
+ "start_offset": 14,
+ "end_offset": 16,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "awesome!",
+ "start_offset": 25,
+ "end_offset": 42,
+ "type": "word",
+ "position": 2
+ }
+ ]
+}
+```
+
+### Custom analyzer with a mapping character filter for synonym replacement
+
+The following example analyzer replaces specific characters and patterns before applying the synonym filter:
+
+```json
+PUT mapping_analyzer_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "synonym_mapping_analyzer": {
+ "type": "custom",
+ "char_filter": ["underscore_to_space"],
+ "tokenizer": "standard",
+ "filter": ["lowercase", "stop", "synonym_filter"]
+ }
+ },
+ "char_filter": {
+ "underscore_to_space": {
+ "type": "mapping",
+ "mappings": ["_ => ' '"]
+ }
+ },
+ "filter": {
+ "synonym_filter": {
+ "type": "synonym",
+ "synonyms": [
+ "quick, fast, speedy",
+ "big, large, huge"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET mapping_analyzer_index/_analyze
+{
+ "analyzer": "synonym_mapping_analyzer",
+ "text": "The slow_green_turtle is very large"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {"token": "slow","start_offset": 4,"end_offset": 8,"type": "","position": 1},
+ {"token": "green","start_offset": 9,"end_offset": 14,"type": "","position": 2},
+ {"token": "turtle","start_offset": 15,"end_offset": 21,"type": "","position": 3},
+ {"token": "very","start_offset": 25,"end_offset": 29,"type": "","position": 5},
+ {"token": "large","start_offset": 30,"end_offset": 35,"type": "","position": 6},
+ {"token": "big","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6},
+ {"token": "huge","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6}
+ ]
+}
+```
+
+### Custom analyzer with a custom pattern-based character filter for number normalization
+
+The following example analyzer normalizes phone numbers by removing dashes and spaces and applies edge n-grams to the normalized text to support partial matches:
+
+```json
+PUT advanced_pattern_replace_analyzer_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "phone_number_analyzer": {
+ "type": "custom",
+ "char_filter": ["phone_normalization"],
+ "tokenizer": "standard",
+ "filter": ["lowercase", "edge_ngram"]
+ }
+ },
+ "char_filter": {
+ "phone_normalization": {
+ "type": "pattern_replace",
+ "pattern": "[-\\s]",
+ "replacement": ""
+ }
+ },
+ "filter": {
+ "edge_ngram": {
+ "type": "edge_ngram",
+ "min_gram": 3,
+ "max_gram": 10
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET advanced_pattern_replace_analyzer_index/_analyze
+{
+ "analyzer": "phone_number_analyzer",
+ "text": "123-456 7890"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {"token": "123","start_offset": 0,"end_offset": 12,"type": "","position": 0},
+ {"token": "1234","start_offset": 0,"end_offset": 12,"type": "","position": 0},
+ {"token": "12345","start_offset": 0,"end_offset": 12,"type": "","position": 0},
+ {"token": "123456","start_offset": 0,"end_offset": 12,"type": "","position": 0},
+ {"token": "1234567","start_offset": 0,"end_offset": 12,"type": "","position": 0},
+ {"token": "12345678","start_offset": 0,"end_offset": 12,"type": "","position": 0},
+ {"token": "123456789","start_offset": 0,"end_offset": 12,"type": "","position": 0},
+ {"token": "1234567890","start_offset": 0,"end_offset": 12,"type": "","position": 0}
+ ]
+}
+```
+
+## Position increment gap
+
+The `position_increment_gap` parameter sets a positional gap between terms when indexing multi-valued fields, such as arrays. This gap ensures that phrase queries don't match terms across separate values unless explicitly allowed. For example, a default gap of 100 specifies that terms in different array entries are 100 positions apart, preventing unintended matches in phrase searches. You can adjust this value or set it to `0` in order to allow phrases to span across array values.
+
+The following example demonstrates the effect of `position_increment_gap` using a `match_phrase` query.
+
+1. Index a document in a `test-index`:
+
+ ```json
+ PUT test-index/_doc/1
+ {
+ "names": [ "Slow green", "turtle swims"]
+ }
+ ```
+ {% include copy-curl.html %}
+
+1. Query the document using a `match_phrase` query:
+
+ ```json
+ GET test-index/_search
+ {
+ "query": {
+ "match_phrase": {
+ "names": {
+ "query": "green turtle"
+ }
+ }
+ }
+ }
+ ```
+ {% include copy-curl.html %}
+
+ The response returns no hits because the distance between the terms `green` and `turtle` is `100` (the default `position_increment_gap`).
+
+1. Now query the document using a `match_phrase` query with a `slop` parameter that is higher than the `position_increment_gap`:
+
+ ```json
+ GET test-index/_search
+ {
+ "query": {
+ "match_phrase": {
+ "names": {
+ "query": "green turtle",
+ "slop": 101
+ }
+ }
+ }
+ }
+ ```
+ {% include copy-curl.html %}
+
+ The response contains the matching document:
+
+ ```json
+ {
+ "took": 4,
+ "timed_out": false,
+ "_shards": {
+ "total": 1,
+ "successful": 1,
+ "skipped": 0,
+ "failed": 0
+ },
+ "hits": {
+ "total": {
+ "value": 1,
+ "relation": "eq"
+ },
+ "max_score": 0.010358453,
+ "hits": [
+ {
+ "_index": "test-index",
+ "_id": "1",
+ "_score": 0.010358453,
+ "_source": {
+ "names": [
+ "Slow green",
+ "turtle swims"
+ ]
+ }
+ }
+ ]
+ }
+ }
+ ```
diff --git a/_analyzers/index.md b/_analyzers/index.md
index def6563f3e..1dc38b2cd4 100644
--- a/_analyzers/index.md
+++ b/_analyzers/index.md
@@ -51,7 +51,7 @@ For a list of supported analyzers, see [Analyzers]({{site.url}}{{site.baseurl}}/
## Custom analyzers
-If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer.
+If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer. For more information, see [Creating a custom analyzer]({{site.url}}{{site.baseurl}}/analyzers/custom-analyzer/).
## Text analysis at indexing time and query time
diff --git a/_analyzers/language-analyzers/index.md b/_analyzers/language-analyzers/index.md
index 89a4a42254..cc53c1cdac 100644
--- a/_analyzers/language-analyzers/index.md
+++ b/_analyzers/language-analyzers/index.md
@@ -1,7 +1,7 @@
---
layout: default
title: Language analyzers
-nav_order: 100
+nav_order: 140
parent: Analyzers
has_children: true
has_toc: true
diff --git a/_analyzers/normalizers.md b/_analyzers/normalizers.md
index b89659f814..52841d2571 100644
--- a/_analyzers/normalizers.md
+++ b/_analyzers/normalizers.md
@@ -1,7 +1,7 @@
---
layout: default
title: Normalizers
-nav_order: 100
+nav_order: 110
---
# Normalizers
diff --git a/_analyzers/supported-analyzers/fingerprint.md b/_analyzers/supported-analyzers/fingerprint.md
new file mode 100644
index 0000000000..267e16c039
--- /dev/null
+++ b/_analyzers/supported-analyzers/fingerprint.md
@@ -0,0 +1,115 @@
+---
+layout: default
+title: Fingerprint analyzer
+parent: Analyzers
+nav_order: 60
+---
+
+# Fingerprint analyzer
+
+The `fingerprint` analyzer creates a text fingerprint. The analyzer sorts and deduplicates the terms (tokens) generated from the input and then concatenates them using a separator. It is commonly used for data deduplication because it produces the same output for similar inputs containing the same words, regardless of word order.
+
+The `fingerprint` analyzer comprises the following components:
+
+- Standard tokenizer
+- Lowercase token filter
+- ASCII folding token filter
+- Stop token filter
+- Fingerprint token filter
+
+## Parameters
+
+The `fingerprint` analyzer can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`separator` | Optional | String | Specifies the character used to concatenate the terms after they have been tokenized, sorted, and deduplicated. Default is an empty space (` `).
+`max_output_size` | Optional | Integer | Defines the maximum size of the output token. If the concatenated fingerprint exceeds this size, it will be truncated. Default is `255`.
+`stopwords` | Optional | String or list of strings | A custom or predefined list of stopwords. Default is `_none_`.
+`stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stopwords.
+
+
+## Example
+
+Use the following command to create an index named `my_custom_fingerprint_index` with a `fingerprint` analyzer:
+
+```json
+PUT /my_custom_fingerprint_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_custom_fingerprint_analyzer": {
+ "type": "fingerprint",
+ "separator": "-",
+ "max_output_size": 50,
+ "stopwords": ["to", "the", "over", "and"]
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "my_field": {
+ "type": "text",
+ "analyzer": "my_custom_fingerprint_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_custom_fingerprint_index/_analyze
+{
+ "analyzer": "my_custom_fingerprint_analyzer",
+ "text": "The slow turtle swims over to the dog"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "dog-slow-swims-turtle",
+ "start_offset": 0,
+ "end_offset": 37,
+ "type": "fingerprint",
+ "position": 0
+ }
+ ]
+}
+```
+
+## Further customization
+
+If further customization is needed, you can define an analyzer with additional `fingerprint` analyzer components:
+
+```json
+PUT /custom_fingerprint_analyzer
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "custom_fingerprint": {
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "asciifolding",
+ "fingerprint"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
diff --git a/_analyzers/supported-analyzers/index.md b/_analyzers/supported-analyzers/index.md
index 43e41b8d6a..b54660478f 100644
--- a/_analyzers/supported-analyzers/index.md
+++ b/_analyzers/supported-analyzers/index.md
@@ -18,14 +18,14 @@ The following table lists the built-in analyzers that OpenSearch provides. The l
Analyzer | Analysis performed | Analyzer output
:--- | :--- | :---
-**Standard** (default) | - Parses strings into tokens at word boundaries - Removes most punctuation - Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
-**Simple** | - Parses strings into tokens on any non-letter character - Removes non-letter characters - Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
-**Whitespace** | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
-**Stop** | - Parses strings into tokens on any non-letter character - Removes non-letter characters - Removes stop words - Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`]
-**Keyword** (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`]
-**Pattern** | - Parses strings into tokens using regular expressions - Supports converting strings to lowercase - Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
+[**Standard**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/standard/) (default) | - Parses strings into tokens at word boundaries - Removes most punctuation - Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
+[**Simple**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/simple/) | - Parses strings into tokens on any non-letter character - Removes non-letter characters - Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
+[**Whitespace**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/whitespace/) | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
+[**Stop**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/stop/) | - Parses strings into tokens on any non-letter character - Removes non-letter characters - Removes stop words - Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`]
+[**Keyword**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/keyword/) (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`]
+[**Pattern**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/pattern/)| - Parses strings into tokens using regular expressions - Supports converting strings to lowercase - Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
[**Language**]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/index/) | Performs analysis specific to a certain language (for example, `english`). | [`fun`, `contribut`, `brand`, `new`, `pr`, `2`, `opensearch`]
-**Fingerprint** | - Parses strings on any non-letter character - Normalizes characters by converting them to ASCII - Converts tokens to lowercase - Sorts, deduplicates, and concatenates tokens into a single token - Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`] Note that the apostrophe was converted to its ASCII counterpart.
+[**Fingerprint**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/fingerprint/) | - Parses strings on any non-letter character - Normalizes characters by converting them to ASCII - Converts tokens to lowercase - Sorts, deduplicates, and concatenates tokens into a single token - Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`] Note that the apostrophe was converted to its ASCII counterpart.
## Language analyzers
@@ -37,5 +37,5 @@ The following table lists the additional analyzers that OpenSearch supports.
| Analyzer | Analysis performed |
|:---------------|:---------------------------------------------------------------------------------------------------------|
-| `phone` | An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) for parsing phone numbers. |
-| `phone-search` | A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) for parsing phone numbers. |
+| [`phone`]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/phone-analyzers/#the-phone-analyzer) | An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) for parsing phone numbers. |
+| [`phone-search`]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/phone-analyzers/#the-phone-search-analyzer) | A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) for parsing phone numbers. |
diff --git a/_analyzers/supported-analyzers/keyword.md b/_analyzers/supported-analyzers/keyword.md
new file mode 100644
index 0000000000..00c314d0c4
--- /dev/null
+++ b/_analyzers/supported-analyzers/keyword.md
@@ -0,0 +1,78 @@
+---
+layout: default
+title: Keyword analyzer
+parent: Analyzers
+nav_order: 80
+---
+
+# Keyword analyzer
+
+The `keyword` analyzer doesn't tokenize text at all. Instead, it treats the entire input as a single token and does not break it into individual tokens. The `keyword` analyzer is often used for fields containing email addresses, URLs, or product IDs and in other cases where tokenization is not desirable.
+
+## Example
+
+Use the following command to create an index named `my_keyword_index` with a `keyword` analyzer:
+
+```json
+PUT /my_keyword_index
+{
+ "mappings": {
+ "properties": {
+ "my_field": {
+ "type": "text",
+ "analyzer": "keyword"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Configuring a custom analyzer
+
+Use the following command to configure an index with a custom analyzer that is equivalent to the `keyword` analyzer:
+
+```json
+PUT /my_custom_keyword_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_keyword_analyzer": {
+ "tokenizer": "keyword"
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_custom_keyword_index/_analyze
+{
+ "analyzer": "my_keyword_analyzer",
+ "text": "Just one token"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "Just one token",
+ "start_offset": 0,
+ "end_offset": 14,
+ "type": "word",
+ "position": 0
+ }
+ ]
+}
+```
diff --git a/_analyzers/supported-analyzers/pattern.md b/_analyzers/supported-analyzers/pattern.md
new file mode 100644
index 0000000000..bc3cb9a306
--- /dev/null
+++ b/_analyzers/supported-analyzers/pattern.md
@@ -0,0 +1,97 @@
+---
+layout: default
+title: Pattern analyzer
+parent: Analyzers
+nav_order: 90
+---
+
+# Pattern analyzer
+
+The `pattern` analyzer allows you to define a custom analyzer that uses a regular expression (regex) to split input text into tokens. It also provides options for applying regex flags, converting tokens to lowercase, and filtering out stopwords.
+
+## Parameters
+
+The `pattern` analyzer can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`pattern` | Optional | String | A [Java regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used to tokenize the input. Default is `\W+`.
+`flags` | Optional | String | A string containing pipe-separated [Java regex flags](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary) that modify the behavior of the regular expression.
+`lowercase` | Optional | Boolean | Whether to convert tokens to lowercase. Default is `true`.
+`stopwords` | Optional | String or list of strings | A string specifying a predefined list of stopwords (such as `_english_`) or an array specifying a custom list of stopwords. Default is `_none_`.
+`stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stopwords.
+
+
+## Example
+
+Use the following command to create an index named `my_pattern_index` with a `pattern` analyzer:
+
+```json
+PUT /my_pattern_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_pattern_analyzer": {
+ "type": "pattern",
+ "pattern": "\\W+",
+ "lowercase": true,
+ "stopwords": ["and", "is"]
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "my_field": {
+ "type": "text",
+ "analyzer": "my_pattern_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_pattern_index/_analyze
+{
+ "analyzer": "my_pattern_analyzer",
+ "text": "OpenSearch is fast and scalable"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "opensearch",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "fast",
+ "start_offset": 14,
+ "end_offset": 18,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "scalable",
+ "start_offset": 23,
+ "end_offset": 31,
+ "type": "word",
+ "position": 4
+ }
+ ]
+}
+```
diff --git a/_analyzers/supported-analyzers/phone-analyzers.md b/_analyzers/supported-analyzers/phone-analyzers.md
index f24b7cf328..d94bfe192f 100644
--- a/_analyzers/supported-analyzers/phone-analyzers.md
+++ b/_analyzers/supported-analyzers/phone-analyzers.md
@@ -1,6 +1,6 @@
---
layout: default
-title: Phone number
+title: Phone number analyzers
parent: Analyzers
nav_order: 140
---
diff --git a/_analyzers/supported-analyzers/simple.md b/_analyzers/supported-analyzers/simple.md
new file mode 100644
index 0000000000..29f8f9a533
--- /dev/null
+++ b/_analyzers/supported-analyzers/simple.md
@@ -0,0 +1,99 @@
+---
+layout: default
+title: Simple analyzer
+parent: Analyzers
+nav_order: 100
+---
+
+# Simple analyzer
+
+The `simple` analyzer is a very basic analyzer that breaks text into terms at non-letter characters and lowercases the terms. Unlike the `standard` analyzer, the `simple` analyzer treats everything except for alphabetic characters as delimiters, meaning that it does not recognize numbers, punctuation, or special characters as part of the tokens.
+
+## Example
+
+Use the following command to create an index named `my_simple_index` with a `simple` analyzer:
+
+```json
+PUT /my_simple_index
+{
+ "mappings": {
+ "properties": {
+ "my_field": {
+ "type": "text",
+ "analyzer": "simple"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Configuring a custom analyzer
+
+Use the following command to configure an index with a custom analyzer that is equivalent to a `simple` analyzer with an added `html_strip` character filter:
+
+```json
+PUT /my_custom_simple_index
+{
+ "settings": {
+ "analysis": {
+ "char_filter": {
+ "html_strip": {
+ "type": "html_strip"
+ }
+ },
+ "tokenizer": {
+ "my_lowercase_tokenizer": {
+ "type": "lowercase"
+ }
+ },
+ "analyzer": {
+ "my_custom_simple_analyzer": {
+ "type": "custom",
+ "char_filter": ["html_strip"],
+ "tokenizer": "my_lowercase_tokenizer",
+ "filter": ["lowercase"]
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "my_field": {
+ "type": "text",
+ "analyzer": "my_custom_simple_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_custom_simple_index/_analyze
+{
+ "analyzer": "my_custom_simple_analyzer",
+ "text": "
"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {"token": "the","start_offset": 3,"end_offset": 6,"type": "word","position": 0},
+ {"token": "slow","start_offset": 7,"end_offset": 11,"type": "word","position": 1},
+ {"token": "turtle","start_offset": 12,"end_offset": 18,"type": "word","position": 2},
+ {"token": "swims","start_offset": 19,"end_offset": 24,"type": "word","position": 3},
+ {"token": "over","start_offset": 25,"end_offset": 29,"type": "word","position": 4},
+ {"token": "to","start_offset": 30,"end_offset": 32,"type": "word","position": 5},
+ {"token": "dogs","start_offset": 33,"end_offset": 37,"type": "word","position": 6}
+ ]
+}
+```
diff --git a/_analyzers/supported-analyzers/standard.md b/_analyzers/supported-analyzers/standard.md
new file mode 100644
index 0000000000..d5c3650d5d
--- /dev/null
+++ b/_analyzers/supported-analyzers/standard.md
@@ -0,0 +1,97 @@
+---
+layout: default
+title: Standard analyzer
+parent: Analyzers
+nav_order: 50
+---
+
+# Standard analyzer
+
+The `standard` analyzer is the default analyzer used when no other analyzer is specified. It is designed to provide a basic and efficient approach to generic text processing.
+
+This analyzer consists of the following tokenizers and token filters:
+
+- `standard` tokenizer: Removes most punctuation and splits text on spaces and other common delimiters.
+- `lowercase` token filter: Converts all tokens to lowercase, ensuring case-insensitive matching.
+- `stop` token filter: Removes common stopwords, such as "the", "is", and "and", from the tokenized output.
+
+## Example
+
+Use the following command to create an index named `my_standard_index` with a `standard` analyzer:
+
+```json
+PUT /my_standard_index
+{
+ "mappings": {
+ "properties": {
+ "my_field": {
+ "type": "text",
+ "analyzer": "standard"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Parameters
+
+You can configure a `standard` analyzer with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`.
+`stopwords` | Optional | String or list of strings | A string specifying a predefined list of stopwords (such as `_english_`) or an array specifying a custom list of stopwords. Default is `_none_`.
+`stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stop words.
+
+
+## Configuring a custom analyzer
+
+Use the following command to configure an index with a custom analyzer that is equivalent to the `standard` analyzer:
+
+```json
+PUT /my_custom_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_custom_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "stop"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_custom_index/_analyze
+{
+ "analyzer": "my_custom_analyzer",
+ "text": "The slow turtle swims away"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {"token": "slow","start_offset": 4,"end_offset": 8,"type": "","position": 1},
+ {"token": "turtle","start_offset": 9,"end_offset": 15,"type": "","position": 2},
+ {"token": "swims","start_offset": 16,"end_offset": 21,"type": "","position": 3},
+ {"token": "away","start_offset": 22,"end_offset": 26,"type": "","position": 4}
+ ]
+}
+```
diff --git a/_analyzers/supported-analyzers/stop.md b/_analyzers/supported-analyzers/stop.md
new file mode 100644
index 0000000000..df62c7fe58
--- /dev/null
+++ b/_analyzers/supported-analyzers/stop.md
@@ -0,0 +1,177 @@
+---
+layout: default
+title: Stop analyzer
+parent: Analyzers
+nav_order: 110
+---
+
+# Stop analyzer
+
+The `stop` analyzer removes a predefined list of stopwords. This analyzer consists of a `lowercase` tokenizer and a `stop` token filter.
+
+## Parameters
+
+You can configure a `stop` analyzer with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`stopwords` | Optional | String or list of strings | A string specifying a predefined list of stopwords (such as `_english_`) or an array specifying a custom list of stopwords. Default is `_english_`.
+`stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stopwords.
+
+## Example
+
+Use the following command to create an index named `my_stop_index` with a `stop` analyzer:
+
+```json
+PUT /my_stop_index
+{
+ "mappings": {
+ "properties": {
+ "my_field": {
+ "type": "text",
+ "analyzer": "stop"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Configuring a custom analyzer
+
+Use the following command to configure an index with a custom analyzer that is equivalent to a `stop` analyzer:
+
+```json
+PUT /my_custom_stop_analyzer_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_custom_stop_analyzer": {
+ "tokenizer": "lowercase",
+ "filter": [
+ "stop"
+ ]
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "my_field": {
+ "type": "text",
+ "analyzer": "my_custom_stop_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_custom_stop_analyzer_index/_analyze
+{
+ "analyzer": "my_custom_stop_analyzer",
+ "text": "The large turtle is green and brown"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "large",
+ "start_offset": 4,
+ "end_offset": 9,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "turtle",
+ "start_offset": 10,
+ "end_offset": 16,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "green",
+ "start_offset": 20,
+ "end_offset": 25,
+ "type": "word",
+ "position": 4
+ },
+ {
+ "token": "brown",
+ "start_offset": 30,
+ "end_offset": 35,
+ "type": "word",
+ "position": 6
+ }
+ ]
+}
+```
+
+# Specifying stopwords
+
+The following example request specifies a custom list of stopwords:
+
+```json
+PUT /my_new_custom_stop_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_custom_stop_analyzer": {
+ "type": "stop",
+ "stopwords": ["is", "and", "was"]
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "description": {
+ "type": "text",
+ "analyzer": "my_custom_stop_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+The following example request specifies a path to the file containing stopwords:
+
+```json
+PUT /my_new_custom_stop_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_custom_stop_analyzer": {
+ "type": "stop",
+ "stopwords_path": "stopwords.txt"
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "description": {
+ "type": "text",
+ "analyzer": "my_custom_stop_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+In this example, the file is located in the config directory. You can also specify a full path to the file.
\ No newline at end of file
diff --git a/_analyzers/supported-analyzers/whitespace.md b/_analyzers/supported-analyzers/whitespace.md
new file mode 100644
index 0000000000..4691b4f733
--- /dev/null
+++ b/_analyzers/supported-analyzers/whitespace.md
@@ -0,0 +1,87 @@
+---
+layout: default
+title: Whitespace analyzer
+parent: Analyzers
+nav_order: 120
+---
+
+# Whitespace analyzer
+
+The `whitespace` analyzer breaks text into tokens based only on white space characters (for example, spaces and tabs). It does not apply any transformations, such as lowercasing or removing stopwords, so the original case of the text is retained and punctuation is included as part of the tokens.
+
+## Example
+
+Use the following command to create an index named `my_whitespace_index` with a `whitespace` analyzer:
+
+```json
+PUT /my_whitespace_index
+{
+ "mappings": {
+ "properties": {
+ "my_field": {
+ "type": "text",
+ "analyzer": "whitespace"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Configuring a custom analyzer
+
+Use the following command to configure an index with a custom analyzer that is equivalent to a `whitespace` analyzer with an added `lowercase` character filter:
+
+```json
+PUT /my_custom_whitespace_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_custom_whitespace_analyzer": {
+ "type": "custom",
+ "tokenizer": "whitespace",
+ "filter": ["lowercase"]
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "my_field": {
+ "type": "text",
+ "analyzer": "my_custom_whitespace_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_custom_whitespace_index/_analyze
+{
+ "analyzer": "my_custom_whitespace_analyzer",
+ "text": "The SLOW turtle swims away! 123"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {"token": "the","start_offset": 0,"end_offset": 3,"type": "word","position": 0},
+ {"token": "slow","start_offset": 4,"end_offset": 8,"type": "word","position": 1},
+ {"token": "turtle","start_offset": 9,"end_offset": 15,"type": "word","position": 2},
+ {"token": "swims","start_offset": 16,"end_offset": 21,"type": "word","position": 3},
+ {"token": "away!","start_offset": 22,"end_offset": 27,"type": "word","position": 4},
+ {"token": "123","start_offset": 28,"end_offset": 31,"type": "word","position": 5}
+ ]
+}
+```
diff --git a/_analyzers/token-filters/flatten-graph.md b/_analyzers/token-filters/flatten-graph.md
new file mode 100644
index 0000000000..8d51c57400
--- /dev/null
+++ b/_analyzers/token-filters/flatten-graph.md
@@ -0,0 +1,109 @@
+---
+layout: default
+title: Flatten graph
+parent: Token filters
+nav_order: 150
+---
+
+# Flatten graph token filter
+
+The `flatten_graph` token filter is used to handle complex token relationships that occur when multiple tokens are generated at the same position in a graph structure. Some token filters, like `synonym_graph` and `word_delimiter_graph`, generate multi-position tokens---tokens that overlap or span multiple positions. These token graphs are useful for search queries but are not directly supported during indexing. The `flatten_graph` token filter resolves multi-position tokens into a linear sequence of tokens. Flattening the graph ensures compatibility with the indexing process.
+
+Token graph flattening is a lossy process. Whenever possible, avoid using the `flatten_graph` filter. Instead, apply graph token filters exclusively in search analyzers, removing the need for the `flatten_graph` filter.
+{: .important}
+
+## Example
+
+The following example request creates a new index named `test_index` and configures an analyzer with a `flatten_graph` filter:
+
+```json
+PUT /test_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_index_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "my_custom_filter",
+ "flatten_graph"
+ ]
+ }
+ },
+ "filter": {
+ "my_custom_filter": {
+ "type": "word_delimiter_graph",
+ "catenate_all": true
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /test_index/_analyze
+{
+ "analyzer": "my_index_analyzer",
+ "text": "OpenSearch helped many employers"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "OpenSearch",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "",
+ "position": 0,
+ "positionLength": 2
+ },
+ {
+ "token": "Open",
+ "start_offset": 0,
+ "end_offset": 4,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "Search",
+ "start_offset": 4,
+ "end_offset": 10,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "helped",
+ "start_offset": 11,
+ "end_offset": 17,
+ "type": "",
+ "position": 2
+ },
+ {
+ "token": "many",
+ "start_offset": 18,
+ "end_offset": 22,
+ "type": "",
+ "position": 3
+ },
+ {
+ "token": "employers",
+ "start_offset": 23,
+ "end_offset": 32,
+ "type": "",
+ "position": 4
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md
index 14abeab567..875e94db5a 100644
--- a/_analyzers/token-filters/index.md
+++ b/_analyzers/token-filters/index.md
@@ -17,7 +17,7 @@ The following table lists all token filters that OpenSearch supports.
Token filter | Underlying Lucene token filter| Description
[`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it.
[`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
-`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
+[`cjk_bigram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-bigram/) | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
[`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: - Folds full-width ASCII character variants into their equivalent basic Latin characters. - Folds half-width katakana character variants into their equivalent kana characters.
[`classic`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/classic) | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
[`common_grams`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/common_gram/) | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams.
@@ -29,18 +29,18 @@ Token filter | Underlying Lucene token filter| Description
[`edge_ngram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/edge-ngram/) | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token.
[`elision`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/elision/) | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane).
[`fingerprint`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/fingerprint/) | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token.
-`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
+[`flatten_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/flatten-graph/) | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
[`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell allows a word to have multiple stems, this filter can emit multiple tokens for each consumed token. Requires the configuration of one or more language-specific Hunspell dictionaries.
[`hyphenation_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hyphenation-decompounder/) | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
[`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
[`keep_words`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-words/) | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.
[`keyword_marker`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keyword-marker/) | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed.
-`keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword.
-`kstem` | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary.
-`kuromoji_completion` | [JapaneseCompletionFilter](https://lucene.apache.org/core/9_10_0/analysis/kuromoji/org/apache/lucene/analysis/ja/JapaneseCompletionFilter.html) | Adds Japanese romanized terms to the token stream (in addition to the original tokens). Usually used to support autocomplete on Japanese search terms. Note that the filter has a `mode` parameter, which should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins).
-`length` | [LengthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html) | Removes tokens whose lengths are shorter or longer than the length range specified by `min` and `max`.
-`limit` | [LimitTokenCountFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html) | Limits the number of output tokens. A common use case is to limit the size of document field values based on token count.
-`lowercase` | [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) is for the English language. You can set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)).
+[`keyword_repeat`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keyword-repeat/) | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword.
+[`kstem`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/kstem/) | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides KStem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary.
+[`kuromoji_completion`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/kuromoji-completion/) | [JapaneseCompletionFilter](https://lucene.apache.org/core/9_10_0/analysis/kuromoji/org/apache/lucene/analysis/ja/JapaneseCompletionFilter.html) | Adds Japanese romanized terms to a token stream (in addition to the original tokens). Usually used to support autocomplete of Japanese search terms. Note that the filter has a `mode` parameter that should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins).
+[`length`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/length/) | [LengthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html) | Removes tokens that are shorter or longer than the length range specified by `min` and `max`.
+[`limit`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/limit/) | [LimitTokenCountFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html) | Limits the number of output tokens. For example, document field value sizes can be limited based on the token count.
+[`lowercase`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/lowercase/) | [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) processes the English language. To process other languages, set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)).
[`min_hash`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/min-hash/) | [MinHashFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/minhash/MinHashFilter.html) | Uses the [MinHash technique](https://en.wikipedia.org/wiki/MinHash) to estimate document similarity. Performs the following operations on a token stream sequentially: 1. Hashes each token in the stream. 2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket. 3. Outputs the smallest hash from each bucket as a token stream.
[`multiplexer`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/multiplexer/) | N/A | Emits multiple tokens at the same position. Runs each token through each of the specified filter lists separately and outputs the results as separate tokens.
[`ngram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/ngram/) | [NGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html) | Tokenizes the given token into n-grams of lengths between `min_gram` and `max_gram`.
@@ -51,17 +51,17 @@ Token filter | Underlying Lucene token filter| Description
[`porter_stem`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/porter-stem/) | [PorterStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language.
[`predicate_token_filter`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/predicate-token-filter/) | N/A | Removes tokens that do not match the specified predicate script. Supports only inline Painless scripts.
[`remove_duplicates`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/remove-duplicates/) | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position.
-`reverse` | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`.
-`shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`].
-`snowball` | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). You can use the `snowball` token filter with the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`.
-`stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
-`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
-`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
+[`reverse`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/reverse/) | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`.
+[`shingle`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/shingle/) | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but are generated using words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`].
+[`snowball`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/snowball/) | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). The `snowball` token filter supports using the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`.
+[`stemmer`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stemmer/) | N/A | Provides algorithmic stemming for the following languages used in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
+[`stemmer_override`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stemmer-override/) | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
+[`stop`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stop/) | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
[`synonym`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym/) | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file.
[`synonym_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym-graph/) | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process.
-`trim` | [TrimFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing white space from each token in a stream.
-`truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit.
-`unique` | N/A | Ensures each token is unique by removing duplicate tokens from a stream.
-`uppercase` | [UpperCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to uppercase.
-`word_delimiter` | [WordDelimiterFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules.
-`word_delimiter_graph` | [WordDelimiterGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. Assigns multi-position tokens a `positionLength` attribute.
+[`trim`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/trim/) | [TrimFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing white space characters from each token in a stream.
+[`truncate`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/truncate/) | [TruncateTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens with lengths exceeding the specified character limit.
+[`unique`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/unique/) | N/A | Ensures that each token is unique by removing duplicate tokens from a stream.
+[`uppercase`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/uppercase/) | [UpperCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to uppercase.
+[`word_delimiter`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter/) | [WordDelimiterFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens on non-alphanumeric characters and performs normalization based on the specified rules.
+[`word_delimiter_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter-graph/) | [WordDelimiterGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens on non-alphanumeric characters and performs normalization based on the specified rules. Assigns a `positionLength` attribute to multi-position tokens.
diff --git a/_analyzers/token-filters/keyword-repeat.md b/_analyzers/token-filters/keyword-repeat.md
new file mode 100644
index 0000000000..5ba15a037c
--- /dev/null
+++ b/_analyzers/token-filters/keyword-repeat.md
@@ -0,0 +1,160 @@
+---
+layout: default
+title: Keyword repeat
+parent: Token filters
+nav_order: 210
+---
+
+# Keyword repeat token filter
+
+The `keyword_repeat` token filter emits the keyword version of a token into a token stream. This filter is typically used when you want to retain both the original token and its modified version after further token transformations, such as stemming or synonym expansion. The duplicated tokens allow the original, unchanged version of the token to remain in the final analysis alongside the modified versions.
+
+The `keyword_repeat` token filter should be placed before stemming filters. Stemming is not applied to every token, thus you may have duplicate tokens in the same position after stemming. To remove duplicate tokens, use the `remove_duplicates` token filter after the stemmer.
+{: .note}
+
+
+## Example
+
+The following example request creates a new index named `my_index` and configures an analyzer with a `keyword_repeat` filter:
+
+```json
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_kstem": {
+ "type": "kstem"
+ },
+ "my_lowercase": {
+ "type": "lowercase"
+ }
+ },
+ "analyzer": {
+ "my_custom_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "my_lowercase",
+ "keyword_repeat",
+ "my_kstem"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index/_analyze
+{
+ "analyzer": "my_custom_analyzer",
+ "text": "Stopped quickly"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "stopped",
+ "start_offset": 0,
+ "end_offset": 7,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "stop",
+ "start_offset": 0,
+ "end_offset": 7,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "quickly",
+ "start_offset": 8,
+ "end_offset": 15,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "quick",
+ "start_offset": 8,
+ "end_offset": 15,
+ "type": "",
+ "position": 1
+ }
+ ]
+}
+```
+
+You can further examine the impact of the `keyword_repeat` token filter by adding the following parameters to the `_analyze` query:
+
+```json
+POST /my_index/_analyze
+{
+ "analyzer": "my_custom_analyzer",
+ "text": "Stopped quickly",
+ "explain": true,
+ "attributes": "keyword"
+}
+```
+{% include copy-curl.html %}
+
+The response includes detailed information, such as tokenization, filtering, and the application of specific token filters:
+
+```json
+{
+ "detail": {
+ "custom_analyzer": true,
+ "charfilters": [],
+ "tokenizer": {
+ "name": "standard",
+ "tokens": [
+ {"token": "OpenSearch","start_offset": 0,"end_offset": 10,"type": "","position": 0},
+ {"token": "helped","start_offset": 11,"end_offset": 17,"type": "","position": 1},
+ {"token": "many","start_offset": 18,"end_offset": 22,"type": "","position": 2},
+ {"token": "employers","start_offset": 23,"end_offset": 32,"type": "","position": 3}
+ ]
+ },
+ "tokenfilters": [
+ {
+ "name": "lowercase",
+ "tokens": [
+ {"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "","position": 0},
+ {"token": "helped","start_offset": 11,"end_offset": 17,"type": "","position": 1},
+ {"token": "many","start_offset": 18,"end_offset": 22,"type": "","position": 2},
+ {"token": "employers","start_offset": 23,"end_offset": 32,"type": "","position": 3}
+ ]
+ },
+ {
+ "name": "keyword_marker_filter",
+ "tokens": [
+ {"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "","position": 0,"keyword": true},
+ {"token": "helped","start_offset": 11,"end_offset": 17,"type": "","position": 1,"keyword": false},
+ {"token": "many","start_offset": 18,"end_offset": 22,"type": "","position": 2,"keyword": false},
+ {"token": "employers","start_offset": 23,"end_offset": 32,"type": "","position": 3,"keyword": false}
+ ]
+ },
+ {
+ "name": "kstem_filter",
+ "tokens": [
+ {"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "","position": 0,"keyword": true},
+ {"token": "help","start_offset": 11,"end_offset": 17,"type": "","position": 1,"keyword": false},
+ {"token": "many","start_offset": 18,"end_offset": 22,"type": "","position": 2,"keyword": false},
+ {"token": "employer","start_offset": 23,"end_offset": 32,"type": "","position": 3,"keyword": false}
+ ]
+ }
+ ]
+ }
+}
+```
\ No newline at end of file
diff --git a/_analyzers/token-filters/kstem.md b/_analyzers/token-filters/kstem.md
new file mode 100644
index 0000000000..d13fd2c675
--- /dev/null
+++ b/_analyzers/token-filters/kstem.md
@@ -0,0 +1,92 @@
+---
+layout: default
+title: KStem
+parent: Token filters
+nav_order: 220
+---
+
+# KStem token filter
+
+The `kstem` token filter is a stemming filter used to reduce words to their root forms. The filter is a lightweight algorithmic stemmer designed for the English language that performs the following stemming operations:
+
+- Reduces plurals to their singular form.
+- Converts different verb tenses to their base form.
+- Removes common derivational endings, such as "-ing" or "-ed".
+
+The `kstem` token filter is equivalent to the a `stemmer` filter configured with a `light_english` language. It provides a more conservative stemming compared to other stemming filters like `porter_stem`.
+
+The `kstem` token filter is based on the Lucene KStemFilter. For more information, see the [Lucene documentation](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html).
+
+## Example
+
+The following example request creates a new index named `my_kstem_index` and configures an analyzer with a `kstem` filter:
+
+```json
+PUT /my_kstem_index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "kstem_filter": {
+ "type": "kstem"
+ }
+ },
+ "analyzer": {
+ "my_kstem_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "kstem_filter"
+ ]
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "content": {
+ "type": "text",
+ "analyzer": "my_kstem_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_kstem_index/_analyze
+{
+ "analyzer": "my_kstem_analyzer",
+ "text": "stops stopped"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "stop",
+ "start_offset": 0,
+ "end_offset": 5,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "stop",
+ "start_offset": 6,
+ "end_offset": 13,
+ "type": "",
+ "position": 1
+ }
+ ]
+}
+```
\ No newline at end of file
diff --git a/_analyzers/token-filters/kuromoji-completion.md b/_analyzers/token-filters/kuromoji-completion.md
new file mode 100644
index 0000000000..24833e92e1
--- /dev/null
+++ b/_analyzers/token-filters/kuromoji-completion.md
@@ -0,0 +1,127 @@
+---
+layout: default
+title: Kuromoji completion
+parent: Token filters
+nav_order: 230
+---
+
+# Kuromoji completion token filter
+
+The `kuromoji_completion` token filter is used to stem Katakana words in Japanese, which are often used to represent foreign words or loanwords. This filter is especially useful for autocompletion or suggest queries, in which partial matches on Katakana words can be expanded to include their full forms.
+
+To use this token filter, you must first install the `analysis-kuromoji` plugin on all nodes by running `bin/opensearch-plugin install analysis-kuromoji` and then restart the cluster. For more information about installing additional plugins, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/index/).
+
+## Example
+
+The following example request creates a new index named `kuromoji_sample` and configures an analyzer with a `kuromoji_completion` filter:
+
+```json
+PUT kuromoji_sample
+{
+ "settings": {
+ "index": {
+ "analysis": {
+ "analyzer": {
+ "my_analyzer": {
+ "tokenizer": "kuromoji_tokenizer",
+ "filter": [
+ "my_katakana_stemmer"
+ ]
+ }
+ },
+ "filter": {
+ "my_katakana_stemmer": {
+ "type": "kuromoji_completion"
+ }
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer with text that translates to "use a computer":
+
+```json
+POST /kuromoji_sample/_analyze
+{
+ "analyzer": "my_analyzer",
+ "text": "コンピューターを使う"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "コンピューター", // The original Katakana word "computer".
+ "start_offset": 0,
+ "end_offset": 7,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "konpyuーtaー", // Romanized version (Romaji) of "コンピューター".
+ "start_offset": 0,
+ "end_offset": 7,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "konnpyuーtaー", // Another possible romanized version of "コンピューター" (with a slight variation in the spelling).
+ "start_offset": 0,
+ "end_offset": 7,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "を", // A Japanese particle, "wo" or "o"
+ "start_offset": 7,
+ "end_offset": 8,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "wo", // Romanized form of the particle "を" (often pronounced as "o").
+ "start_offset": 7,
+ "end_offset": 8,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "o", // Another version of the romanization.
+ "start_offset": 7,
+ "end_offset": 8,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "使う", // The verb "use" in Kanji.
+ "start_offset": 8,
+ "end_offset": 10,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "tukau", // Romanized version of "使う"
+ "start_offset": 8,
+ "end_offset": 10,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "tsukau", // Another romanized version of "使う", where "tsu" is more phonetically correct
+ "start_offset": 8,
+ "end_offset": 10,
+ "type": "word",
+ "position": 2
+ }
+ ]
+}
+```
\ No newline at end of file
diff --git a/_analyzers/token-filters/length.md b/_analyzers/token-filters/length.md
new file mode 100644
index 0000000000..f6c5dcc706
--- /dev/null
+++ b/_analyzers/token-filters/length.md
@@ -0,0 +1,91 @@
+---
+layout: default
+title: Length
+parent: Token filters
+nav_order: 240
+---
+
+# Length token filter
+
+The `length` token filter is used to remove tokens that don't meet specified length criteria (minimum and maximum values) from the token stream.
+
+## Parameters
+
+The `length` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`min` | Optional | Integer | The minimum token length. Default is `0`.
+`max` | Optional | Integer | The maximum token length. Default is `Integer.MAX_VALUE` (`2147483647`).
+
+
+## Example
+
+The following example request creates a new index named `my_index` and configures an analyzer with a `length` filter:
+
+```json
+PUT my_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "only_keep_4_to_10_characters": {
+ "tokenizer": "whitespace",
+ "filter": [ "length_4_to_10" ]
+ }
+ },
+ "filter": {
+ "length_4_to_10": {
+ "type": "length",
+ "min": 4,
+ "max": 10
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my_index/_analyze
+{
+ "analyzer": "only_keep_4_to_10_characters",
+ "text": "OpenSearch is a great tool!"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "OpenSearch",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "great",
+ "start_offset": 16,
+ "end_offset": 21,
+ "type": "word",
+ "position": 3
+ },
+ {
+ "token": "tool!",
+ "start_offset": 22,
+ "end_offset": 27,
+ "type": "word",
+ "position": 4
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/limit.md b/_analyzers/token-filters/limit.md
new file mode 100644
index 0000000000..a849f5f06b
--- /dev/null
+++ b/_analyzers/token-filters/limit.md
@@ -0,0 +1,89 @@
+---
+layout: default
+title: Limit
+parent: Token filters
+nav_order: 250
+---
+
+# Limit token filter
+
+The `limit` token filter is used to limit the number of tokens passed through the analysis chain.
+
+## Parameters
+
+The `limit` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`max_token_count` | Optional | Integer | The maximum number of tokens to be generated. Default is `1`.
+`consume_all_tokens` | Optional | Boolean | (Expert-level setting) Uses all tokens from the tokenizer, even if the result exceeds `max_token_count`. When this parameter is set, the output still only contains the number of tokens specified by `max_token_count`. However, all tokens generated by the tokenizer are processed. Default is `false`.
+
+## Example
+
+The following example request creates a new index named `my_index` and configures an analyzer with a `limit` filter:
+
+```json
+PUT my_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "three_token_limit": {
+ "tokenizer": "standard",
+ "filter": [ "custom_token_limit" ]
+ }
+ },
+ "filter": {
+ "custom_token_limit": {
+ "type": "limit",
+ "max_token_count": 3
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my_index/_analyze
+{
+ "analyzer": "three_token_limit",
+ "text": "OpenSearch is a powerful and flexible search engine."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "OpenSearch",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "is",
+ "start_offset": 11,
+ "end_offset": 13,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "a",
+ "start_offset": 14,
+ "end_offset": 15,
+ "type": "",
+ "position": 2
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/lowercase.md b/_analyzers/token-filters/lowercase.md
new file mode 100644
index 0000000000..89f0f219fa
--- /dev/null
+++ b/_analyzers/token-filters/lowercase.md
@@ -0,0 +1,82 @@
+---
+layout: default
+title: Lowercase
+parent: Token filters
+nav_order: 260
+---
+
+# Lowercase token filter
+
+The `lowercase` token filter is used to convert all characters in the token stream to lowercase, making searches case insensitive.
+
+## Parameters
+
+The `lowercase` token filter can be configured with the following parameter.
+
+Parameter | Required/Optional | Description
+:--- | :--- | :---
+ `language` | Optional | Specifies a language-specific token filter. Valid values are: - [`greek`](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html) - [`irish`](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html) - [`turkish`](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html). Default is the [Lucene LowerCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html).
+
+## Example
+
+The following example request creates a new index named `custom_lowercase_example`. It configures an analyzer with a `lowercase` filter and specifies `greek` as the `language`:
+
+```json
+PUT /custom_lowercase_example
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "greek_lowercase_example": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": ["greek_lowercase"]
+ }
+ },
+ "filter": {
+ "greek_lowercase": {
+ "type": "lowercase",
+ "language": "greek"
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /custom_lowercase_example/_analyze
+{
+ "analyzer": "greek_lowercase_example",
+ "text": "Αθήνα ΕΛΛΑΔΑ"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "αθηνα",
+ "start_offset": 0,
+ "end_offset": 5,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "ελλαδα",
+ "start_offset": 6,
+ "end_offset": 12,
+ "type": "",
+ "position": 1
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/reverse.md b/_analyzers/token-filters/reverse.md
new file mode 100644
index 0000000000..dc48f07e77
--- /dev/null
+++ b/_analyzers/token-filters/reverse.md
@@ -0,0 +1,86 @@
+---
+layout: default
+title: Reverse
+parent: Token filters
+nav_order: 360
+---
+
+# Reverse token filter
+
+The `reverse` token filter reverses the order of the characters in each token, making suffix information accessible at the beginning of the reversed tokens during analysis.
+
+This is useful for suffix-based searches:
+
+The `reverse` token filter is useful when you need to perform suffix-based searches, such as in the following scenarios:
+
+- **Suffix matching**: Searching for words based on their suffixes, such as identifying words with a specific ending (for example, `-tion` or `-ing`).
+- **File extension searches**: Searching for files by their extensions, such as `.txt` or `.jpg`.
+- **Custom sorting or ranking**: By reversing tokens, you can implement unique sorting or ranking logic based on suffixes.
+- **Autocomplete for suffixes**: Implementing autocomplete suggestions that use suffixes rather than prefixes.
+
+
+## Example
+
+The following example request creates a new index named `my-reverse-index` and configures an analyzer with a `reverse` filter:
+
+```json
+PUT /my-reverse-index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "reverse_filter": {
+ "type": "reverse"
+ }
+ },
+ "analyzer": {
+ "my_reverse_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "reverse_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-reverse-index/_analyze
+{
+ "analyzer": "my_reverse_analyzer",
+ "text": "hello world"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "olleh",
+ "start_offset": 0,
+ "end_offset": 5,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "dlrow",
+ "start_offset": 6,
+ "end_offset": 11,
+ "type": "",
+ "position": 1
+ }
+ ]
+}
+```
\ No newline at end of file
diff --git a/_analyzers/token-filters/shingle.md b/_analyzers/token-filters/shingle.md
new file mode 100644
index 0000000000..ea961bf3e0
--- /dev/null
+++ b/_analyzers/token-filters/shingle.md
@@ -0,0 +1,120 @@
+---
+layout: default
+title: Shingle
+parent: Token filters
+nav_order: 370
+---
+
+# Shingle token filter
+
+The `shingle` token filter is used to generate word n-grams, or _shingles_, from input text. For example, for the string `slow green turtle`, the `shingle` filter creates the following one- and two-word shingles: `slow`, `slow green`, `green`, `green turtle`, and `turtle`.
+
+This token filter is often used in conjunction with other filters to enhance search accuracy by indexing phrases rather than individual tokens. For more information, see [Phrase suggester]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/did-you-mean/#phrase-suggester).
+
+## Parameters
+
+The `shingle` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`min_shingle_size` | Optional | Integer | The minimum number of tokens to concatenate. Default is `2`.
+`max_shingle_size` | Optional | Integer | The maximum number of tokens to concatenate. Default is `2`.
+`output_unigrams` | Optional | Boolean | Whether to include unigrams (individual tokens) as output. Default is `true`.
+`output_unigrams_if_no_shingles` | Optional | Boolean | Whether to output unigrams if no shingles are generated. Default is `false`.
+`token_separator` | Optional | String | A separator used to concatenate tokens into a shingle. Default is a space (`" "`).
+`filler_token` | Optional | String | A token inserted into empty positions or gaps between tokens. Default is an underscore (`_`).
+
+If `output_unigrams` and `output_unigrams_if_no_shingles` are both set to `true`, `output_unigrams_if_no_shingles` is ignored.
+{: .note}
+
+## Example
+
+The following example request creates a new index named `my-shingle-index` and configures an analyzer with a `shingle` filter:
+
+```json
+PUT /my-shingle-index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_shingle_filter": {
+ "type": "shingle",
+ "min_shingle_size": 2,
+ "max_shingle_size": 2,
+ "output_unigrams": true
+ }
+ },
+ "analyzer": {
+ "my_shingle_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "my_shingle_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-shingle-index/_analyze
+{
+ "analyzer": "my_shingle_analyzer",
+ "text": "slow green turtle"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "slow",
+ "start_offset": 0,
+ "end_offset": 4,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "slow green",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "shingle",
+ "position": 0,
+ "positionLength": 2
+ },
+ {
+ "token": "green",
+ "start_offset": 5,
+ "end_offset": 10,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "green turtle",
+ "start_offset": 5,
+ "end_offset": 17,
+ "type": "shingle",
+ "position": 1,
+ "positionLength": 2
+ },
+ {
+ "token": "turtle",
+ "start_offset": 11,
+ "end_offset": 17,
+ "type": "",
+ "position": 2
+ }
+ ]
+}
+```
\ No newline at end of file
diff --git a/_analyzers/token-filters/snowball.md b/_analyzers/token-filters/snowball.md
new file mode 100644
index 0000000000..149486e727
--- /dev/null
+++ b/_analyzers/token-filters/snowball.md
@@ -0,0 +1,108 @@
+---
+layout: default
+title: Snowball
+parent: Token filters
+nav_order: 380
+---
+
+# Snowball token filter
+
+The `snowball` token filter is a stemming filter based on the [Snowball](https://snowballstem.org/) algorithm. It supports many languages and is more efficient and accurate than the Porter stemming algorithm.
+
+## Parameters
+
+The `snowball` token filter can be configured with a `language` parameter that accepts the following values:
+
+- `Arabic`
+- `Armenian`
+- `Basque`
+- `Catalan`
+- `Danish`
+- `Dutch`
+- `English` (default)
+- `Estonian`
+- `Finnish`
+- `French`
+- `German`
+- `German2`
+- `Hungarian`
+- `Italian`
+- `Irish`
+- `Kp`
+- `Lithuanian`
+- `Lovins`
+- `Norwegian`
+- `Porter`
+- `Portuguese`
+- `Romanian`
+- `Russian`
+- `Spanish`
+- `Swedish`
+- `Turkish`
+
+## Example
+
+The following example request creates a new index named `my-snowball-index` and configures an analyzer with a `snowball` filter:
+
+```json
+PUT /my-snowball-index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_snowball_filter": {
+ "type": "snowball",
+ "language": "English"
+ }
+ },
+ "analyzer": {
+ "my_snowball_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "my_snowball_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-snowball-index/_analyze
+{
+ "analyzer": "my_snowball_analyzer",
+ "text": "running runners"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "run",
+ "start_offset": 0,
+ "end_offset": 7,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "runner",
+ "start_offset": 8,
+ "end_offset": 15,
+ "type": "",
+ "position": 1
+ }
+ ]
+}
+```
\ No newline at end of file
diff --git a/_analyzers/token-filters/stemmer-override.md b/_analyzers/token-filters/stemmer-override.md
new file mode 100644
index 0000000000..c06f673714
--- /dev/null
+++ b/_analyzers/token-filters/stemmer-override.md
@@ -0,0 +1,139 @@
+---
+layout: default
+title: Stemmer override
+parent: Token filters
+nav_order: 400
+---
+
+# Stemmer override token filter
+
+The `stemmer_override` token filter allows you to define custom stemming rules that override the behavior of default stemmers like Porter or Snowball. This can be useful when you want to apply specific stemming behavior to certain words that might not be modified correctly by the standard stemming algorithms.
+
+## Parameters
+
+The `stemmer_override` token filter must be configured with exactly one of the following parameters.
+
+Parameter | Data type | Description
+:--- | :--- | :---
+`rules` | String | Defines the override rules directly in the settings.
+`rules_path` | String | Specifies the path to the file containing custom rules (mappings). The path can be either an absolute path or a path relative to the config directory.
+
+## Example
+
+The following example request creates a new index named `my-index` and configures an analyzer with a `stemmer_override` filter:
+
+```json
+PUT /my-index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_stemmer_override_filter": {
+ "type": "stemmer_override",
+ "rules": [
+ "running, runner => run",
+ "bought => buy",
+ "best => good"
+ ]
+ }
+ },
+ "analyzer": {
+ "my_custom_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "my_stemmer_override_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-index/_analyze
+{
+ "analyzer": "my_custom_analyzer",
+ "text": "I am a runner and bought the best shoes"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "i",
+ "start_offset": 0,
+ "end_offset": 1,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "am",
+ "start_offset": 2,
+ "end_offset": 4,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "a",
+ "start_offset": 5,
+ "end_offset": 6,
+ "type": "",
+ "position": 2
+ },
+ {
+ "token": "run",
+ "start_offset": 7,
+ "end_offset": 13,
+ "type": "",
+ "position": 3
+ },
+ {
+ "token": "and",
+ "start_offset": 14,
+ "end_offset": 17,
+ "type": "",
+ "position": 4
+ },
+ {
+ "token": "buy",
+ "start_offset": 18,
+ "end_offset": 24,
+ "type": "",
+ "position": 5
+ },
+ {
+ "token": "the",
+ "start_offset": 25,
+ "end_offset": 28,
+ "type": "",
+ "position": 6
+ },
+ {
+ "token": "good",
+ "start_offset": 29,
+ "end_offset": 33,
+ "type": "",
+ "position": 7
+ },
+ {
+ "token": "shoes",
+ "start_offset": 34,
+ "end_offset": 39,
+ "type": "",
+ "position": 8
+ }
+ ]
+}
+```
\ No newline at end of file
diff --git a/_analyzers/token-filters/stemmer.md b/_analyzers/token-filters/stemmer.md
new file mode 100644
index 0000000000..dd1344fcbc
--- /dev/null
+++ b/_analyzers/token-filters/stemmer.md
@@ -0,0 +1,118 @@
+---
+layout: default
+title: Stemmer
+parent: Token filters
+nav_order: 390
+---
+
+# Stemmer token filter
+
+The `stemmer` token filter reduces words to their root or base form (also known as their _stem_).
+
+## Parameters
+
+The `stemmer` token filter can be configured with a `language` parameter that accepts the following values:
+
+- Arabic: `arabic`
+- Armenian: `armenian`
+- Basque: `basque`
+- Bengali: `bengali`
+- Brazilian Portuguese: `brazilian`
+- Bulgarian: `bulgarian`
+- Catalan: `catalan`
+- Czech: `czech`
+- Danish: `danish`
+- Dutch: `dutch, dutch_kp`
+- English: `english` (default), `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`
+- Estonian: `estonian`
+- Finnish: `finnish`, `light_finnish`
+- French: `light_french`, `french`, `minimal_french`
+- Galician: `galician`, `minimal_galician` (plural step only)
+- German: `light_german`, `german`, `german2`, `minimal_german`
+- Greek: `greek`
+- Hindi: `hindi`
+- Hungarian: `hungarian, light_hungarian`
+- Indonesian: `indonesian`
+- Irish: `irish`
+- Italian: `light_italian, italian`
+- Kurdish (Sorani): `sorani`
+- Latvian: `latvian`
+- Lithuanian: `lithuanian`
+- Norwegian (Bokmål): `norwegian`, `light_norwegian`, `minimal_norwegian`
+- Norwegian (Nynorsk): `light_nynorsk`, `minimal_nynorsk`
+- Portuguese: `light_portuguese`, `minimal_portuguese`, `portuguese`, `portuguese_rslp`
+- Romanian: `romanian`
+- Russian: `russian`, `light_russian`
+- Spanish: `light_spanish`, `spanish`
+- Swedish: `swedish`, `light_swedish`
+- Turkish: `turkish`
+
+You can also use the `name` parameter as an alias for the `language` parameter. If both are set, the `name` parameter is ignored.
+{: .note}
+
+## Example
+
+The following example request creates a new index named `my-stemmer-index` and configures an analyzer with a `stemmer` filter:
+
+```json
+PUT /my-stemmer-index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_english_stemmer": {
+ "type": "stemmer",
+ "language": "english"
+ }
+ },
+ "analyzer": {
+ "my_stemmer_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "my_english_stemmer"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-stemmer-index/_analyze
+{
+ "analyzer": "my_stemmer_analyzer",
+ "text": "running runs"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "run",
+ "start_offset": 0,
+ "end_offset": 7,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "run",
+ "start_offset": 8,
+ "end_offset": 12,
+ "type": "",
+ "position": 1
+ }
+ ]
+}
+```
\ No newline at end of file
diff --git a/_analyzers/token-filters/stop.md b/_analyzers/token-filters/stop.md
new file mode 100644
index 0000000000..8f3e01b72d
--- /dev/null
+++ b/_analyzers/token-filters/stop.md
@@ -0,0 +1,111 @@
+---
+layout: default
+title: Stop
+parent: Token filters
+nav_order: 410
+---
+
+# Stop token filter
+
+The `stop` token filter is used to remove common words (also known as _stopwords_) from a token stream during analysis. Stopwords are typically articles and prepositions, such as `a` or `for`. These words are not significantly meaningful in search queries and are often excluded to improve search efficiency and relevance.
+
+The default list of English stopwords includes the following words: `a`, `an`, `and`, `are`, `as`, `at`, `be`, `but`, `by`, `for`, `if`, `in`, `into`, `is`, `it`, `no`, `not`, `of`, `on`, `or`, `such`, `that`, `the`, `their`, `then`, `there`, `these`, `they`, `this`, `to`, `was`, `will`, and `with`.
+
+## Parameters
+
+The `stop` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`stopwords` | Optional | String | Specifies either a custom array of stopwords or a language for which to fetch the predefined Lucene stopword list:
- [`_arabic_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt) - [`_armenian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt) - [`_basque_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt) - [`_bengali_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/bn/stopwords.txt) - [`_brazilian_` (Brazilian Portuguese)](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/br/stopwords.txt) - [`_bulgarian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt) - [`_catalan_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt) - [`_cjk_` (Chinese, Japanese, and Korean)](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/cjk/stopwords.txt) - [`_czech_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/cz/stopwords.txt) - [`_danish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/danish_stop.txt) - [`_dutch_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/dutch_stop.txt) - [`_english_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L48) (Default) - [`_estonian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/et/stopwords.txt) - [`_finnish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt) - [`_french_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt) - [`_galician_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/gl/stopwords.txt) - [`_german_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/german_stop.txt) - [`_greek_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/el/stopwords.txt) - [`_hindi_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/hi/stopwords.txt) - [`_hungarian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/hungarian_stop.txt) - [`_indonesian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt) - [`_irish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ga/stopwords.txt) - [`_italian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt) - [`_latvian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/lv/stopwords.txt) - [`_lithuanian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/lt/stopwords.txt) - [`_norwegian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/norwegian_stop.txt) - [`_persian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt) - [`_portuguese_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/portuguese_stop.txt) - [`_romanian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt) - [`_russian_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt) - [`_sorani_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/sr/stopwords.txt) - [`_spanish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ckb/stopwords.txt) - [`_swedish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/swedish_stop.txt) - [`_thai_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/th/stopwords.txt) - [`_turkish_`](https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt)
+`stopwords_path` | Optional | String | Specifies the file path (absolute or relative to the config directory) of the file containing custom stopwords.
+`ignore_case` | Optional | Boolean | If `true`, stopwords will be matched regardless of their case. Default is `false`.
+`remove_trailing` | Optional | Boolean | If `true`, trailing stopwords will be removed during analysis. Default is `true`.
+
+## Example
+
+The following example request creates a new index named `my-stopword-index` and configures an analyzer with a `stop` filter that uses the predefined stopword list for the English language:
+
+```json
+PUT /my-stopword-index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_stop_filter": {
+ "type": "stop",
+ "stopwords": "_english_"
+ }
+ },
+ "analyzer": {
+ "my_stop_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "my_stop_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-stopword-index/_analyze
+{
+ "analyzer": "my_stop_analyzer",
+ "text": "A quick dog jumps over the turtle"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "quick",
+ "start_offset": 2,
+ "end_offset": 7,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "dog",
+ "start_offset": 8,
+ "end_offset": 11,
+ "type": "",
+ "position": 2
+ },
+ {
+ "token": "jumps",
+ "start_offset": 12,
+ "end_offset": 17,
+ "type": "",
+ "position": 3
+ },
+ {
+ "token": "over",
+ "start_offset": 18,
+ "end_offset": 22,
+ "type": "",
+ "position": 4
+ },
+ {
+ "token": "turtle",
+ "start_offset": 27,
+ "end_offset": 33,
+ "type": "",
+ "position": 6
+ }
+ ]
+}
+```
\ No newline at end of file
diff --git a/_analyzers/token-filters/synonym-graph.md b/_analyzers/token-filters/synonym-graph.md
index 75c7c79151..d8e763d1fc 100644
--- a/_analyzers/token-filters/synonym-graph.md
+++ b/_analyzers/token-filters/synonym-graph.md
@@ -19,7 +19,7 @@ Parameter | Required/Optional | Data type | Description
`synonyms_path` | Either `synonyms` or `synonyms_path` must be specified | String | The file path to a file containing synonym rules (either an absolute path or a path relative to the config directory).
`lenient` | Optional | Boolean | Whether to ignore exceptions when loading the rule configurations. Default is `false`.
`format` | Optional | String | Specifies the format used to determine how OpenSearch defines and interprets synonyms. Valid values are: - `solr` - [`wordnet`](https://wordnet.princeton.edu/). Default is `solr`.
-`expand` | Optional | Boolean | Whether to expand equivalent synonym rules. Default is `false`.
For example: If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows: - `quick => quick` - `quick => fast` - `fast => quick` - `fast => fast`
If `expand` is set to `false`, the synonym rules are configured as follows: - `quick => quick` - `fast => quick`
+`expand` | Optional | Boolean | Whether to expand equivalent synonym rules. Default is `true`.
For example: If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows: - `quick => quick` - `quick => fast` - `fast => quick` - `fast => fast`
If `expand` is set to `false`, the synonym rules are configured as follows: - `quick => quick` - `fast => quick`
## Example: Solr format
diff --git a/_analyzers/token-filters/synonym.md b/_analyzers/token-filters/synonym.md
index a6865b14d7..a1dfff845d 100644
--- a/_analyzers/token-filters/synonym.md
+++ b/_analyzers/token-filters/synonym.md
@@ -2,7 +2,7 @@
layout: default
title: Synonym
parent: Token filters
-nav_order: 420
+nav_order: 415
---
# Synonym token filter
@@ -19,7 +19,7 @@ Parameter | Required/Optional | Data type | Description
`synonyms_path` | Either `synonyms` or `synonyms_path` must be specified | String | The file path to a file containing synonym rules (either an absolute path or a path relative to the config directory).
`lenient` | Optional | Boolean | Whether to ignore exceptions when loading the rule configurations. Default is `false`.
`format` | Optional | String | Specifies the format used to determine how OpenSearch defines and interprets synonyms. Valid values are: - `solr` - [`wordnet`](https://wordnet.princeton.edu/). Default is `solr`.
-`expand` | Optional | Boolean | Whether to expand equivalent synonym rules. Default is `false`.
For example: If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows: - `quick => quick` - `quick => fast` - `fast => quick` - `fast => fast`
If `expand` is set to `false`, the synonym rules are configured as follows: - `quick => quick` - `fast => quick`
+`expand` | Optional | Boolean | Whether to expand equivalent synonym rules. Default is `true`.
For example: If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows: - `quick => quick` - `quick => fast` - `fast => quick` - `fast => fast`
If `expand` is set to `false`, the synonym rules are configured as follows: - `quick => quick` - `fast => quick`
## Example: Solr format
diff --git a/_analyzers/token-filters/trim.md b/_analyzers/token-filters/trim.md
new file mode 100644
index 0000000000..cdfebed52f
--- /dev/null
+++ b/_analyzers/token-filters/trim.md
@@ -0,0 +1,93 @@
+---
+layout: default
+title: Trim
+parent: Token filters
+nav_order: 430
+---
+
+# Trim token filter
+
+The `trim` token filter removes leading and trailing white space characters from tokens.
+
+Many popular tokenizers, such as `standard`, `keyword`, and `whitespace` tokenizers, automatically strip leading and trailing white space characters during tokenization. When using these tokenizers, there is no need to configure an additional `trim` token filter.
+{: .note}
+
+
+## Example
+
+The following example request creates a new index named `my_pattern_trim_index` and configures an analyzer with a `trim` filter and a `pattern` tokenizer, which does not remove leading and trailing white space characters:
+
+```json
+PUT /my_pattern_trim_index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_trim_filter": {
+ "type": "trim"
+ }
+ },
+ "tokenizer": {
+ "my_pattern_tokenizer": {
+ "type": "pattern",
+ "pattern": ","
+ }
+ },
+ "analyzer": {
+ "my_pattern_trim_analyzer": {
+ "type": "custom",
+ "tokenizer": "my_pattern_tokenizer",
+ "filter": [
+ "lowercase",
+ "my_trim_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my_pattern_trim_index/_analyze
+{
+ "analyzer": "my_pattern_trim_analyzer",
+ "text": " OpenSearch , is , powerful "
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "opensearch",
+ "start_offset": 0,
+ "end_offset": 12,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "is",
+ "start_offset": 13,
+ "end_offset": 18,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "powerful",
+ "start_offset": 19,
+ "end_offset": 32,
+ "type": "word",
+ "position": 2
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/truncate.md b/_analyzers/token-filters/truncate.md
new file mode 100644
index 0000000000..16d1452901
--- /dev/null
+++ b/_analyzers/token-filters/truncate.md
@@ -0,0 +1,107 @@
+---
+layout: default
+title: Truncate
+parent: Token filters
+nav_order: 440
+---
+
+# Truncate token filter
+
+The `truncate` token filter is used to shorten tokens exceeding a specified length. It trims tokens to a maximum number of characters, ensuring that tokens exceeding this limit are truncated.
+
+## Parameters
+
+The `truncate` token filter can be configured with the following parameter.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`length` | Optional | Integer | Specifies the maximum length of the generated token. Default is `10`.
+
+## Example
+
+The following example request creates a new index named `truncate_example` and configures an analyzer with a `truncate` filter:
+
+```json
+PUT /truncate_example
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "truncate_filter": {
+ "type": "truncate",
+ "length": 5
+ }
+ },
+ "analyzer": {
+ "truncate_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "truncate_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /truncate_example/_analyze
+{
+ "analyzer": "truncate_analyzer",
+ "text": "OpenSearch is powerful and scalable"
+}
+
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "opens",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "is",
+ "start_offset": 11,
+ "end_offset": 13,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "power",
+ "start_offset": 14,
+ "end_offset": 22,
+ "type": "",
+ "position": 2
+ },
+ {
+ "token": "and",
+ "start_offset": 23,
+ "end_offset": 26,
+ "type": "",
+ "position": 3
+ },
+ {
+ "token": "scala",
+ "start_offset": 27,
+ "end_offset": 35,
+ "type": "",
+ "position": 4
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/unique.md b/_analyzers/token-filters/unique.md
new file mode 100644
index 0000000000..c4dfcbab16
--- /dev/null
+++ b/_analyzers/token-filters/unique.md
@@ -0,0 +1,106 @@
+---
+layout: default
+title: Unique
+parent: Token filters
+nav_order: 450
+---
+
+# Unique token filter
+
+The `unique` token filter ensures that only unique tokens are kept during the analysis process, removing duplicate tokens that appear within a single field or text block.
+
+## Parameters
+
+The `unique` token filter can be configured with the following parameter.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`only_on_same_position` | Optional | Boolean | If `true`, the token filter acts as a `remove_duplicates` token filter and only removes tokens that are in the same position. Default is `false`.
+
+## Example
+
+The following example request creates a new index named `unique_example` and configures an analyzer with a `unique` filter:
+
+```json
+PUT /unique_example
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "unique_filter": {
+ "type": "unique",
+ "only_on_same_position": false
+ }
+ },
+ "analyzer": {
+ "unique_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "unique_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /unique_example/_analyze
+{
+ "analyzer": "unique_analyzer",
+ "text": "OpenSearch OpenSearch is powerful powerful and scalable"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "opensearch",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "is",
+ "start_offset": 22,
+ "end_offset": 24,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "powerful",
+ "start_offset": 25,
+ "end_offset": 33,
+ "type": "",
+ "position": 2
+ },
+ {
+ "token": "and",
+ "start_offset": 43,
+ "end_offset": 46,
+ "type": "",
+ "position": 3
+ },
+ {
+ "token": "scalable",
+ "start_offset": 47,
+ "end_offset": 55,
+ "type": "",
+ "position": 4
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/uppercase.md b/_analyzers/token-filters/uppercase.md
new file mode 100644
index 0000000000..5026892400
--- /dev/null
+++ b/_analyzers/token-filters/uppercase.md
@@ -0,0 +1,83 @@
+---
+layout: default
+title: Uppercase
+parent: Token filters
+nav_order: 460
+---
+
+# Uppercase token filter
+
+The `uppercase` token filter is used to convert all tokens (words) to uppercase during analysis.
+
+## Example
+
+The following example request creates a new index named `uppercase_example` and configures an analyzer with an `uppercase` filter:
+
+```json
+PUT /uppercase_example
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "uppercase_filter": {
+ "type": "uppercase"
+ }
+ },
+ "analyzer": {
+ "uppercase_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "uppercase_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /uppercase_example/_analyze
+{
+ "analyzer": "uppercase_analyzer",
+ "text": "OpenSearch is powerful"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "OPENSEARCH",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "IS",
+ "start_offset": 11,
+ "end_offset": 13,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "POWERFUL",
+ "start_offset": 14,
+ "end_offset": 22,
+ "type": "",
+ "position": 2
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/word-delimiter-graph.md b/_analyzers/token-filters/word-delimiter-graph.md
new file mode 100644
index 0000000000..b901f5a0e5
--- /dev/null
+++ b/_analyzers/token-filters/word-delimiter-graph.md
@@ -0,0 +1,164 @@
+---
+layout: default
+title: Word delimiter graph
+parent: Token filters
+nav_order: 480
+---
+
+# Word delimiter graph token filter
+
+The `word_delimiter_graph` token filter is used to splits token on predefined characters and also offers optional token normalization based on customizable rules.
+
+The `word_delimiter_graph` filter is used to remove punctuation from complex identifiers like part numbers or product IDs. In such cases, it is best used with the `keyword` tokenizer. For hyphenated words, use the `synonym_graph` token filter instead of the `word_delimiter_graph` filter because users frequently search for these terms both with and without hyphens.
+{: .note}
+
+By default, the filter applies the following rules.
+
+| Description | Input | Output |
+|:---|:---|:---|
+| Treats non-alphanumeric characters as delimiters. | `ultra-fast` | `ultra`, `fast` |
+| Removes delimiters at the beginning or end of tokens. | `Z99++'Decoder'`| `Z99`, `Decoder` |
+| Splits tokens when there is a transition between uppercase and lowercase letters. | `OpenSearch` | `Open`, `Search` |
+| Splits tokens when there is a transition between letters and numbers. | `T1000` | `T`, `1000` |
+| Removes the possessive ('s) from the end of tokens. | `John's` | `John` |
+
+It's important **not** to use tokenizers that strip punctuation, like the `standard` tokenizer, with this filter. Doing so may prevent proper token splitting and interfere with options like `catenate_all` or `preserve_original`. We recommend using this filter with a `keyword` or `whitespace` tokenizer.
+{: .important}
+
+## Parameters
+
+You can configure the `word_delimiter_graph` token filter using the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`adjust_offsets` | Optional | Boolean | Determines whether the token offsets should be recalculated for split or concatenated tokens. When `true`, the filter adjusts the token offsets to accurately represent the token's position within the token stream. This adjustment ensures that the token's location in the text aligns with its modified form after processing, which is particularly useful for applications like highlighting or phrase queries. When `false`, the offsets remain unchanged, which may result in misalignment when the processed tokens are mapped back to their positions in the original text. If your analyzer uses filters like `trim` that change the token lengths without changing their offsets, we recommend setting this parameter to `false`. Default is `true`.
+`catenate_all` | Optional | Boolean | Produces concatenated tokens from a sequence of alphanumeric parts. For example, `"quick-fast-200"` becomes `[ quickfast200, quick, fast, 200 ]`. Default is `false`.
+`catenate_numbers` | Optional | Boolean | Concatenates numerical sequences. For example, `"10-20-30"` becomes `[ 102030, 10, 20, 30 ]`. Default is `false`.
+`catenate_words` | Optional | Boolean | Concatenates alphabetic words. For example, `"high-speed-level"` becomes `[ highspeedlevel, high, speed, level ]`. Default is `false`.
+`generate_number_parts` | Optional | Boolean | If `true`, numeric tokens (tokens consisting of numbers only) are included in the output. Default is `true`.
+`generate_word_parts` | Optional | Boolean | If `true`, alphabetical tokens (tokens consisting of alphabetic characters only) are included in the output. Default is `true`.
+`ignore_keywords` | Optional | Boolean | Whether to process tokens marked as keywords. Default is `false`.
+`preserve_original` | Optional | Boolean | Keeps the original token (which may include non-alphanumeric delimiters) alongside the generated tokens in the output. For example, `"auto-drive-300"` becomes `[ auto-drive-300, auto, drive, 300 ]`. If `true`, the filter generates multi-position tokens not supported by indexing, so do not use this filter in an index analyzer or use the `flatten_graph` filter after this filter. Default is `false`.
+`protected_words` | Optional | Array of strings | Specifies tokens that should not be split.
+`protected_words_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing tokens that should not be separated by new lines.
+`split_on_case_change` | Optional | Boolean | Splits tokens where consecutive letters have different cases (one is lowercase and the other is uppercase). For example, `"OpenSearch"` becomes `[ Open, Search ]`. Default is `true`.
+`split_on_numerics` | Optional | Boolean | Splits tokens where there are consecutive letters and numbers. For example `"v8engine"` will become `[ v, 8, engine ]`. Default is `true`.
+`stem_english_possessive` | Optional | Boolean | Removes English possessive endings, such as `'s`. Default is `true`.
+`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split on hyphens. Valid types are: - `ALPHA`: alphabetical - `ALPHANUM`: alphanumeric - `DIGIT`: numeric - `LOWER`: lowercase alphabetical - `SUBWORD_DELIM`: non-alphanumeric delimiter - `UPPER`: uppercase alphabetical
+`type_table_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing a custom character map. The map specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For valid types, see `type_table`.
+
+## Example
+
+The following example request creates a new index named `my-custom-index` and configures an analyzer with a `word_delimiter_graph` filter:
+
+```json
+PUT /my-custom-index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "custom_analyzer": {
+ "tokenizer": "keyword",
+ "filter": [ "custom_word_delimiter_filter" ]
+ }
+ },
+ "filter": {
+ "custom_word_delimiter_filter": {
+ "type": "word_delimiter_graph",
+ "split_on_case_change": true,
+ "split_on_numerics": true,
+ "stem_english_possessive": true
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-custom-index/_analyze
+{
+ "analyzer": "custom_analyzer",
+ "text": "FastCar's Model2023"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "Fast",
+ "start_offset": 0,
+ "end_offset": 4,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "Car",
+ "start_offset": 4,
+ "end_offset": 7,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "Model",
+ "start_offset": 10,
+ "end_offset": 15,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "2023",
+ "start_offset": 15,
+ "end_offset": 19,
+ "type": "word",
+ "position": 3
+ }
+ ]
+}
+```
+
+
+## Differences between the word_delimiter_graph and word_delimiter filters
+
+
+Both the `word_delimiter_graph` and `word_delimiter` token filters generate tokens spanning multiple positions when any of the following parameters are set to `true`:
+
+- `catenate_all`
+- `catenate_numbers`
+- `catenate_words`
+- `preserve_original`
+
+To illustrate the differences between these filters, consider the input text `Pro-XT500`.
+
+
+### word_delimiter_graph
+
+
+The `word_delimiter_graph` filter assigns a `positionLength` attribute to multi-position tokens, indicating how many positions a token spans. This ensures that the filter always generates valid token graphs, making it suitable for use in advanced token graph scenarios. Although token graphs with multi-position tokens are not supported for indexing, they can still be useful in search scenarios. For example, queries like `match_phrase` can use these graphs to generate multiple subqueries from a single input string. For the example input text, the `word_delimiter_graph` filter generates the following tokens:
+
+- `Pro` (position 1)
+- `XT500` (position 2)
+- `ProXT500` (position 1, `positionLength`: 2)
+
+The `positionLength` attribute the production of a valid graph to be used in advanced queries.
+
+
+### word_delimiter
+
+
+In contrast, the `word_delimiter` filter does not assign a `positionLength` attribute to multi-position tokens, leading to invalid graphs when these tokens are present. For the example input text, the `word_delimiter` filter generates the following tokens:
+
+- `Pro` (position 1)
+- `XT500` (position 2)
+- `ProXT500` (position 1, no `positionLength`)
+
+The lack of a `positionLength` attribute results in a token graph that is invalid for token streams containing multi-position tokens.
\ No newline at end of file
diff --git a/_analyzers/token-filters/word-delimiter.md b/_analyzers/token-filters/word-delimiter.md
new file mode 100644
index 0000000000..77a71f28fb
--- /dev/null
+++ b/_analyzers/token-filters/word-delimiter.md
@@ -0,0 +1,128 @@
+---
+layout: default
+title: Word delimiter
+parent: Token filters
+nav_order: 470
+---
+
+# Word delimiter token filter
+
+The `word_delimiter` token filter is used to splits token on predefined characters and also offers optional token normalization based on customizable rules.
+
+We recommend using the `word_delimiter_graph` filter instead of the `word_delimiter` filter whenever possible because the `word_delimiter` filter sometimes produces invalid token graphs. For more information about the differences between the two filters, see [Differences between the `word_delimiter_graph` and `word_delimiter` filters]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter-graph/#differences-between-the-word_delimiter_graph-and-word_delimiter-filters).
+{: .important}
+
+The `word_delimiter` filter is used to remove punctuation from complex identifiers like part numbers or product IDs. In such cases, it is best used with the `keyword` tokenizer. For hyphenated words, use the `synonym_graph` token filter instead of the `word_delimiter` filter because users frequently search for these terms both with and without hyphens.
+{: .note}
+
+By default, the filter applies the following rules.
+
+| Description | Input | Output |
+|:---|:---|:---|
+| Treats non-alphanumeric characters as delimiters. | `ultra-fast` | `ultra`, `fast` |
+| Removes delimiters at the beginning or end of tokens. | `Z99++'Decoder'`| `Z99`, `Decoder` |
+| Splits tokens when there is a transition between uppercase and lowercase letters. | `OpenSearch` | `Open`, `Search` |
+| Splits tokens when there is a transition between letters and numbers. | `T1000` | `T`, `1000` |
+| Removes the possessive ('s) from the end of tokens. | `John's` | `John` |
+
+It's important **not** to use tokenizers that strip punctuation, like the `standard` tokenizer, with this filter. Doing so may prevent proper token splitting and interfere with options like `catenate_all` or `preserve_original`. We recommend using this filter with a `keyword` or `whitespace` tokenizer.
+{: .important}
+
+## Parameters
+
+You can configure the `word_delimiter` token filter using the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`catenate_all` | Optional | Boolean | Produces concatenated tokens from a sequence of alphanumeric parts. For example, `"quick-fast-200"` becomes `[ quickfast200, quick, fast, 200 ]`. Default is `false`.
+`catenate_numbers` | Optional | Boolean | Concatenates numerical sequences. For example, `"10-20-30"` becomes `[ 102030, 10, 20, 30 ]`. Default is `false`.
+`catenate_words` | Optional | Boolean | Concatenates alphabetic words. For example, `"high-speed-level"` becomes `[ highspeedlevel, high, speed, level ]`. Default is `false`.
+`generate_number_parts` | Optional | Boolean | If `true`, numeric tokens (tokens consisting of numbers only) are included in the output. Default is `true`.
+`generate_word_parts` | Optional | Boolean | If `true`, alphabetical tokens (tokens consisting of alphabetic characters only) are included in the output. Default is `true`.
+`preserve_original` | Optional | Boolean | Keeps the original token (which may include non-alphanumeric delimiters) alongside the generated tokens in the output. For example, `"auto-drive-300"` becomes `[ auto-drive-300, auto, drive, 300 ]`. If `true`, the filter generates multi-position tokens not supported by indexing, so do not use this filter in an index analyzer or use the `flatten_graph` filter after this filter. Default is `false`.
+`protected_words` | Optional | Array of strings | Specifies tokens that should not be split.
+`protected_words_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing tokens that should not be separated by new lines.
+`split_on_case_change` | Optional | Boolean | Splits tokens where consecutive letters have different cases (one is lowercase and the other is uppercase). For example, `"OpenSearch"` becomes `[ Open, Search ]`. Default is `true`.
+`split_on_numerics` | Optional | Boolean | Splits tokens where there are consecutive letters and numbers. For example `"v8engine"` will become `[ v, 8, engine ]`. Default is `true`.
+`stem_english_possessive` | Optional | Boolean | Removes English possessive endings, such as `'s`. Default is `true`.
+`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split on hyphens. Valid types are: - `ALPHA`: alphabetical - `ALPHANUM`: alphanumeric - `DIGIT`: numeric - `LOWER`: lowercase alphabetical - `SUBWORD_DELIM`: non-alphanumeric delimiter - `UPPER`: uppercase alphabetical
+`type_table_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing a custom character map. The map specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For valid types, see `type_table`.
+
+## Example
+
+The following example request creates a new index named `my-custom-index` and configures an analyzer with a `word_delimiter` filter:
+
+```json
+PUT /my-custom-index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "custom_analyzer": {
+ "tokenizer": "keyword",
+ "filter": [ "custom_word_delimiter_filter" ]
+ }
+ },
+ "filter": {
+ "custom_word_delimiter_filter": {
+ "type": "word_delimiter",
+ "split_on_case_change": true,
+ "split_on_numerics": true,
+ "stem_english_possessive": true
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-custom-index/_analyze
+{
+ "analyzer": "custom_analyzer",
+ "text": "FastCar's Model2023"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "Fast",
+ "start_offset": 0,
+ "end_offset": 4,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "Car",
+ "start_offset": 4,
+ "end_offset": 7,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "Model",
+ "start_offset": 10,
+ "end_offset": 15,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "2023",
+ "start_offset": 15,
+ "end_offset": 19,
+ "type": "word",
+ "position": 3
+ }
+ ]
+}
+```
diff --git a/_analyzers/tokenizers/index.md b/_analyzers/tokenizers/index.md
index e5ac796c12..f5b5ff0f25 100644
--- a/_analyzers/tokenizers/index.md
+++ b/_analyzers/tokenizers/index.md
@@ -2,7 +2,7 @@
layout: default
title: Tokenizers
nav_order: 60
-has_children: false
+has_children: true
has_toc: false
redirect_from:
- /analyzers/tokenizers/index/
@@ -56,7 +56,7 @@ Tokenizer | Description | Example
`keyword` | - No-op tokenizer - Outputs the entire string unchanged - Can be combined with token filters, like lowercase, to normalize terms | `My repo` becomes `My repo`
`pattern` | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms - Uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) | `https://opensearch.org/forum` becomes [`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`) Can be configured with a regex pattern
`simple_pattern` | - Uses a regular expression pattern to return matching text as terms - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html) - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | Returns an empty array by default Must be configured with a pattern because the pattern defaults to an empty string
-`simple_pattern_split` | - Uses a regular expression pattern to split the text at matches rather than returning the matches as terms - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html) - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default Must be configured with a pattern
+`simple_pattern_split` | - Uses a regular expression pattern to split the text on matches rather than returning the matches as terms - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html) - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default Must be configured with a pattern
`char_group` | - Parses on a set of configurable characters - Faster than tokenizers that run regular expressions | No-op by default Must be configured with a list of characters
`path_hierarchy` | - Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy | `one/two/three` becomes [`one`, `one/two`, `one/two/three`]
diff --git a/_analyzers/tokenizers/lowercase.md b/_analyzers/tokenizers/lowercase.md
new file mode 100644
index 0000000000..5542ecbf50
--- /dev/null
+++ b/_analyzers/tokenizers/lowercase.md
@@ -0,0 +1,93 @@
+---
+layout: default
+title: Lowercase
+parent: Tokenizers
+nav_order: 70
+---
+
+# Lowercase tokenizer
+
+The `lowercase` tokenizer breaks text into terms at white space and then lowercases all the terms. Functionally, this is identical to configuring a `letter` tokenizer with a `lowercase` token filter. However, using a `lowercase` tokenizer is more efficient because the tokenizer actions are performed in a single step.
+
+## Example usage
+
+The following example request creates a new index named `my-lowercase-index` and configures an analyzer with a `lowercase` tokenizer:
+
+```json
+PUT /my-lowercase-index
+{
+ "settings": {
+ "analysis": {
+ "tokenizer": {
+ "my_lowercase_tokenizer": {
+ "type": "lowercase"
+ }
+ },
+ "analyzer": {
+ "my_lowercase_analyzer": {
+ "type": "custom",
+ "tokenizer": "my_lowercase_tokenizer"
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my-lowercase-index/_analyze
+{
+ "analyzer": "my_lowercase_analyzer",
+ "text": "This is a Test. OpenSearch 123!"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "this",
+ "start_offset": 0,
+ "end_offset": 4,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "is",
+ "start_offset": 5,
+ "end_offset": 7,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "a",
+ "start_offset": 8,
+ "end_offset": 9,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "test",
+ "start_offset": 10,
+ "end_offset": 14,
+ "type": "word",
+ "position": 3
+ },
+ {
+ "token": "opensearch",
+ "start_offset": 16,
+ "end_offset": 26,
+ "type": "word",
+ "position": 4
+ }
+ ]
+}
+```
diff --git a/_analyzers/tokenizers/ngram.md b/_analyzers/tokenizers/ngram.md
new file mode 100644
index 0000000000..08ac456267
--- /dev/null
+++ b/_analyzers/tokenizers/ngram.md
@@ -0,0 +1,111 @@
+---
+layout: default
+title: N-gram
+parent: Tokenizers
+nav_order: 80
+---
+
+# N-gram tokenizer
+
+The `ngram` tokenizer splits text into overlapping n-grams (sequences of characters) of a specified length. This tokenizer is particularly useful when you want to perform partial word matching or autocomplete search functionality because it generates substrings (character n-grams) of the original input text.
+
+## Example usage
+
+The following example request creates a new index named `my_index` and configures an analyzer with an `ngram` tokenizer:
+
+```json
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "tokenizer": {
+ "my_ngram_tokenizer": {
+ "type": "ngram",
+ "min_gram": 3,
+ "max_gram": 4,
+ "token_chars": ["letter", "digit"]
+ }
+ },
+ "analyzer": {
+ "my_ngram_analyzer": {
+ "type": "custom",
+ "tokenizer": "my_ngram_tokenizer"
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index/_analyze
+{
+ "analyzer": "my_ngram_analyzer",
+ "text": "OpenSearch"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {"token": "Sea","start_offset": 0,"end_offset": 3,"type": "word","position": 0},
+ {"token": "Sear","start_offset": 0,"end_offset": 4,"type": "word","position": 1},
+ {"token": "ear","start_offset": 1,"end_offset": 4,"type": "word","position": 2},
+ {"token": "earc","start_offset": 1,"end_offset": 5,"type": "word","position": 3},
+ {"token": "arc","start_offset": 2,"end_offset": 5,"type": "word","position": 4},
+ {"token": "arch","start_offset": 2,"end_offset": 6,"type": "word","position": 5},
+ {"token": "rch","start_offset": 3,"end_offset": 6,"type": "word","position": 6}
+ ]
+}
+```
+
+## Parameters
+
+The `ngram` tokenizer can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`min_gram` | Optional | Integer | The minimum length of the n-grams. Default is `1`.
+`max_gram` | Optional | Integer | The maximum length of the n-grams. Default is `2`.
+`token_chars` | Optional | List of strings | The character classes to be included in tokenization. Valid values are: - `letter` - `digit` - `whitespace` - `punctuation` - `symbol` - `custom` (You must also specify the `custom_token_chars` parameter) Default is an empty list (`[]`), which retains all the characters.
+`custom_token_chars` | Optional | String | Custom characters to be included in the tokens.
+
+### Maximum difference between `min_gram` and `max_gram`
+
+The maximum difference between `min_gram` and `max_gram` is configured using the index-level `index.max_ngram_diff` setting and defaults to `1`.
+
+The following example request creates an index with a custom `index.max_ngram_diff` setting:
+
+```json
+PUT /my-index
+{
+ "settings": {
+ "index.max_ngram_diff": 2,
+ "analysis": {
+ "tokenizer": {
+ "my_ngram_tokenizer": {
+ "type": "ngram",
+ "min_gram": 3,
+ "max_gram": 5,
+ "token_chars": ["letter", "digit"]
+ }
+ },
+ "analyzer": {
+ "my_ngram_analyzer": {
+ "type": "custom",
+ "tokenizer": "my_ngram_tokenizer"
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
diff --git a/_analyzers/tokenizers/path-hierarchy.md b/_analyzers/tokenizers/path-hierarchy.md
new file mode 100644
index 0000000000..a6609f30cd
--- /dev/null
+++ b/_analyzers/tokenizers/path-hierarchy.md
@@ -0,0 +1,182 @@
+---
+layout: default
+title: Path hierarchy
+parent: Tokenizers
+nav_order: 90
+---
+
+# Path hierarchy tokenizer
+
+The `path_hierarchy` tokenizer tokenizes file-system-like paths (or similar hierarchical structures) by breaking them down into tokens at each hierarchy level. This tokenizer is particularly useful when working with hierarchical data such as file paths, URLs, or any other delimited paths.
+
+## Example usage
+
+The following example request creates a new index named `my_index` and configures an analyzer with a `path_hierarchy` tokenizer:
+
+```json
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "tokenizer": {
+ "my_path_tokenizer": {
+ "type": "path_hierarchy"
+ }
+ },
+ "analyzer": {
+ "my_path_analyzer": {
+ "type": "custom",
+ "tokenizer": "my_path_tokenizer"
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index/_analyze
+{
+ "analyzer": "my_path_analyzer",
+ "text": "/users/john/documents/report.txt"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "/users",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "/users/john",
+ "start_offset": 0,
+ "end_offset": 11,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "/users/john/documents",
+ "start_offset": 0,
+ "end_offset": 21,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "/users/john/documents/report.txt",
+ "start_offset": 0,
+ "end_offset": 32,
+ "type": "word",
+ "position": 0
+ }
+ ]
+}
+```
+
+## Parameters
+
+The `path_hierarchy` tokenizer can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`delimiter` | Optional | String | Specifies the character used to separate path components. Default is `/`.
+`replacement` | Optional | String | Configures the character used to replace the delimiter in the tokens. Default is `/`.
+`buffer_size` | Optional | Integer | Specifies the buffer size. Default is `1024`.
+`reverse` | Optional | Boolean | If `true`, generates tokens in reverse order. Default is `false`.
+`skip` | Optional | Integer | Specifies the number of initial tokens (levels) to skip when tokenizing. Default is `0`.
+
+## Example using delimiter and replacement parameters
+
+The following example request configures custom `delimiter` and `replacement` parameters:
+
+```json
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "tokenizer": {
+ "my_path_tokenizer": {
+ "type": "path_hierarchy",
+ "delimiter": "\\",
+ "replacement": "\\"
+ }
+ },
+ "analyzer": {
+ "my_path_analyzer": {
+ "type": "custom",
+ "tokenizer": "my_path_tokenizer"
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index/_analyze
+{
+ "analyzer": "my_path_analyzer",
+ "text": "C:\\users\\john\\documents\\report.txt"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "C:",
+ "start_offset": 0,
+ "end_offset": 2,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": """C:\users""",
+ "start_offset": 0,
+ "end_offset": 8,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": """C:\users\john""",
+ "start_offset": 0,
+ "end_offset": 13,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": """C:\users\john\documents""",
+ "start_offset": 0,
+ "end_offset": 23,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": """C:\users\john\documents\report.txt""",
+ "start_offset": 0,
+ "end_offset": 34,
+ "type": "word",
+ "position": 0
+ }
+ ]
+}
+```
\ No newline at end of file
diff --git a/_analyzers/tokenizers/pattern.md b/_analyzers/tokenizers/pattern.md
new file mode 100644
index 0000000000..036dd9050f
--- /dev/null
+++ b/_analyzers/tokenizers/pattern.md
@@ -0,0 +1,167 @@
+---
+layout: default
+title: Pattern
+parent: Tokenizers
+nav_order: 100
+---
+
+# Pattern tokenizer
+
+The `pattern` tokenizer is a highly flexible tokenizer that allows you to split text into tokens based on a custom Java regular expression. Unlike the `simple_pattern` and `simple_pattern_split` tokenizers, which use Lucene regular expressions, the `pattern` tokenizer can handle more complex and detailed regex patterns, offering greater control over how the text is tokenized.
+
+## Example usage
+
+The following example request creates a new index named `my_index` and configures an analyzer with a `pattern` tokenizer. The tokenizer splits text on `-`, `_`, or `.` characters:
+
+```json
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "tokenizer": {
+ "my_pattern_tokenizer": {
+ "type": "pattern",
+ "pattern": "[-_.]"
+ }
+ },
+ "analyzer": {
+ "my_pattern_analyzer": {
+ "type": "custom",
+ "tokenizer": "my_pattern_tokenizer"
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "content": {
+ "type": "text",
+ "analyzer": "my_pattern_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index/_analyze
+{
+ "analyzer": "my_pattern_analyzer",
+ "text": "OpenSearch-2024_v1.2"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "OpenSearch",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "2024",
+ "start_offset": 11,
+ "end_offset": 15,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "v1",
+ "start_offset": 16,
+ "end_offset": 18,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "2",
+ "start_offset": 19,
+ "end_offset": 20,
+ "type": "word",
+ "position": 3
+ }
+ ]
+}
+```
+
+## Parameters
+
+The `pattern` tokenizer can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`pattern` | Optional | String | The pattern used to split text into tokens, specified using a [Java regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). Default is `\W+`.
+`flags` | Optional | String | Configures pipe-separated [flags](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary) to apply to the regular expression, for example, `"CASE_INSENSITIVE|MULTILINE|DOTALL"`.
+`group` | Optional | Integer | Specifies the capture group to be used as a token. Default is `-1` (split on a match).
+
+## Example using a group parameter
+
+The following example request configures a `group` parameter that captures only the second group:
+
+```json
+PUT /my_index_group2
+{
+ "settings": {
+ "analysis": {
+ "tokenizer": {
+ "my_pattern_tokenizer": {
+ "type": "pattern",
+ "pattern": "([a-zA-Z]+)(\\d+)",
+ "group": 2
+ }
+ },
+ "analyzer": {
+ "my_pattern_analyzer": {
+ "type": "custom",
+ "tokenizer": "my_pattern_tokenizer"
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index_group2/_analyze
+{
+ "analyzer": "my_pattern_analyzer",
+ "text": "abc123def456ghi"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "123",
+ "start_offset": 3,
+ "end_offset": 6,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "456",
+ "start_offset": 9,
+ "end_offset": 12,
+ "type": "word",
+ "position": 1
+ }
+ ]
+}
+```
\ No newline at end of file
diff --git a/_analyzers/tokenizers/simple-pattern-split.md b/_analyzers/tokenizers/simple-pattern-split.md
new file mode 100644
index 0000000000..25367f25b5
--- /dev/null
+++ b/_analyzers/tokenizers/simple-pattern-split.md
@@ -0,0 +1,105 @@
+---
+layout: default
+title: Simple pattern split
+parent: Tokenizers
+nav_order: 120
+---
+
+# Simple pattern split tokenizer
+
+The `simple_pattern_split` tokenizer uses a regular expression to split text into tokens. The regular expression defines the pattern used to determine where to split the text. Any matching pattern in the text is used as a delimiter, and the text between delimiters becomes a token. Use this tokenizer when you want to define delimiters and tokenize the rest of the text based on a pattern.
+
+The tokenizer uses the matched parts of the input text (based on the regular expression) only as delimiters or boundaries to split the text into terms. The matched portions are not included in the resulting terms. For example, if the tokenizer is configured to split text at dot characters (`.`) and the input text is `one.two.three`, then the generated terms are `one`, `two`, and `three`. The dot characters themselves are not included in the resulting terms.
+
+## Example usage
+
+The following example request creates a new index named `my_index` and configures an analyzer with a `simple_pattern_split` tokenizer. The tokenizer is configured to split text on hyphens:
+
+```json
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "tokenizer": {
+ "my_pattern_split_tokenizer": {
+ "type": "simple_pattern_split",
+ "pattern": "-"
+ }
+ },
+ "analyzer": {
+ "my_pattern_split_analyzer": {
+ "type": "custom",
+ "tokenizer": "my_pattern_split_tokenizer"
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "content": {
+ "type": "text",
+ "analyzer": "my_pattern_split_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index/_analyze
+{
+ "analyzer": "my_pattern_split_analyzer",
+ "text": "OpenSearch-2024-10-09"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "OpenSearch",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "2024",
+ "start_offset": 11,
+ "end_offset": 15,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "10",
+ "start_offset": 16,
+ "end_offset": 18,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "09",
+ "start_offset": 19,
+ "end_offset": 21,
+ "type": "word",
+ "position": 3
+ }
+ ]
+}
+```
+
+## Parameters
+
+The `simple_pattern_split` tokenizer can be configured with the following parameter.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`pattern` | Optional | String | The pattern used to split text into tokens, specified using a [Lucene regular expression](https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/util/automaton/RegExp.html). Default is an empty string, which returns the input text as one token.
\ No newline at end of file
diff --git a/_analyzers/tokenizers/simple-pattern.md b/_analyzers/tokenizers/simple-pattern.md
new file mode 100644
index 0000000000..eacddd6992
--- /dev/null
+++ b/_analyzers/tokenizers/simple-pattern.md
@@ -0,0 +1,89 @@
+---
+layout: default
+title: Simple pattern
+parent: Tokenizers
+nav_order: 110
+---
+
+# Simple pattern tokenizer
+
+The `simple_pattern` tokenizer identifies matching sequences in text based on a regular expression and uses those sequences as tokens. It extracts terms that match the regular expression. Use this tokenizer when you want to directly extract specific patterns as terms.
+
+## Example usage
+
+The following example request creates a new index named `my_index` and configures an analyzer with a `simple_pattern` tokenizer. The tokenizer extracts numeric terms from text:
+
+```json
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "tokenizer": {
+ "my_pattern_tokenizer": {
+ "type": "simple_pattern",
+ "pattern": "\\d+"
+ }
+ },
+ "analyzer": {
+ "my_pattern_analyzer": {
+ "type": "custom",
+ "tokenizer": "my_pattern_tokenizer"
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index/_analyze
+{
+ "analyzer": "my_pattern_analyzer",
+ "text": "OpenSearch-2024-10-09"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "2024",
+ "start_offset": 11,
+ "end_offset": 15,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "10",
+ "start_offset": 16,
+ "end_offset": 18,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "09",
+ "start_offset": 19,
+ "end_offset": 21,
+ "type": "word",
+ "position": 2
+ }
+ ]
+}
+```
+
+## Parameters
+
+The `simple_pattern` tokenizer can be configured with the following parameter.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`pattern` | Optional | String | The pattern used to split text into tokens, specified using a [Lucene regular expression](https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/util/automaton/RegExp.html). Default is an empty string, which returns the input text as one token.
+
diff --git a/_analyzers/tokenizers/standard.md b/_analyzers/tokenizers/standard.md
new file mode 100644
index 0000000000..c10f25802b
--- /dev/null
+++ b/_analyzers/tokenizers/standard.md
@@ -0,0 +1,111 @@
+---
+layout: default
+title: Standard
+parent: Tokenizers
+nav_order: 130
+---
+
+# Standard tokenizer
+
+The `standard` tokenizer is the default tokenizer in OpenSearch. It tokenizes text based on word boundaries using a grammar-based approach that recognizes letters, digits, and other characters like punctuation. It is highly versatile and suitable for many languages because it uses Unicode text segmentation rules ([UAX#29](https://unicode.org/reports/tr29/)) to break text into tokens.
+
+## Example usage
+
+The following example request creates a new index named `my_index` and configures an analyzer with a `standard` tokenizer:
+
+```json
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_standard_analyzer": {
+ "type": "standard"
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "content": {
+ "type": "text",
+ "analyzer": "my_standard_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index/_analyze
+{
+ "analyzer": "my_standard_analyzer",
+ "text": "OpenSearch is powerful, fast, and scalable."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "opensearch",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "is",
+ "start_offset": 11,
+ "end_offset": 13,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "powerful",
+ "start_offset": 14,
+ "end_offset": 22,
+ "type": "",
+ "position": 2
+ },
+ {
+ "token": "fast",
+ "start_offset": 24,
+ "end_offset": 28,
+ "type": "",
+ "position": 3
+ },
+ {
+ "token": "and",
+ "start_offset": 30,
+ "end_offset": 33,
+ "type": "",
+ "position": 4
+ },
+ {
+ "token": "scalable",
+ "start_offset": 34,
+ "end_offset": 42,
+ "type": "",
+ "position": 5
+ }
+ ]
+}
+```
+
+## Parameters
+
+The `standard` tokenizer can be configured with the following parameter.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`.
+
diff --git a/_analyzers/tokenizers/thai.md b/_analyzers/tokenizers/thai.md
new file mode 100644
index 0000000000..4afb14a9eb
--- /dev/null
+++ b/_analyzers/tokenizers/thai.md
@@ -0,0 +1,108 @@
+---
+layout: default
+title: Thai
+parent: Tokenizers
+nav_order: 140
+---
+
+# Thai tokenizer
+
+The `thai` tokenizer tokenizes Thai language text. Because words in Thai language are not separated by spaces, the tokenizer must identify word boundaries based on language-specific rules.
+
+## Example usage
+
+The following example request creates a new index named `thai_index` and configures an analyzer with a `thai` tokenizer:
+
+```json
+PUT /thai_index
+{
+ "settings": {
+ "analysis": {
+ "tokenizer": {
+ "thai_tokenizer": {
+ "type": "thai"
+ }
+ },
+ "analyzer": {
+ "thai_analyzer": {
+ "type": "custom",
+ "tokenizer": "thai_tokenizer"
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "content": {
+ "type": "text",
+ "analyzer": "thai_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /thai_index/_analyze
+{
+ "analyzer": "thai_analyzer",
+ "text": "ฉันชอบไปเที่ยวที่เชียงใหม่"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "ฉัน",
+ "start_offset": 0,
+ "end_offset": 3,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "ชอบ",
+ "start_offset": 3,
+ "end_offset": 6,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "ไป",
+ "start_offset": 6,
+ "end_offset": 8,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "เที่ยว",
+ "start_offset": 8,
+ "end_offset": 14,
+ "type": "word",
+ "position": 3
+ },
+ {
+ "token": "ที่",
+ "start_offset": 14,
+ "end_offset": 17,
+ "type": "word",
+ "position": 4
+ },
+ {
+ "token": "เชียงใหม่",
+ "start_offset": 17,
+ "end_offset": 26,
+ "type": "word",
+ "position": 5
+ }
+ ]
+}
+```
diff --git a/_analyzers/tokenizers/uax-url-email.md b/_analyzers/tokenizers/uax-url-email.md
new file mode 100644
index 0000000000..34336a4f55
--- /dev/null
+++ b/_analyzers/tokenizers/uax-url-email.md
@@ -0,0 +1,84 @@
+---
+layout: default
+title: UAX URL email
+parent: Tokenizers
+nav_order: 150
+---
+
+# UAX URL email tokenizer
+
+In addition to regular text, the `uax_url_email` tokenizer is designed to handle URLs, email addresses, and domain names. It is based on the Unicode Text Segmentation algorithm ([UAX #29](https://www.unicode.org/reports/tr29/)), which allows it to correctly tokenize complex text, including URLs and email addresses.
+
+## Example usage
+
+The following example request creates a new index named `my_index` and configures an analyzer with a `uax_url_email` tokenizer:
+
+```json
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "tokenizer": {
+ "uax_url_email_tokenizer": {
+ "type": "uax_url_email"
+ }
+ },
+ "analyzer": {
+ "my_uax_analyzer": {
+ "type": "custom",
+ "tokenizer": "uax_url_email_tokenizer"
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "content": {
+ "type": "text",
+ "analyzer": "my_uax_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index/_analyze
+{
+ "analyzer": "my_uax_analyzer",
+ "text": "Contact us at support@example.com or visit https://example.com for details."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {"token": "Contact","start_offset": 0,"end_offset": 7,"type": "","position": 0},
+ {"token": "us","start_offset": 8,"end_offset": 10,"type": "","position": 1},
+ {"token": "at","start_offset": 11,"end_offset": 13,"type": "","position": 2},
+ {"token": "support@example.com","start_offset": 14,"end_offset": 33,"type": "","position": 3},
+ {"token": "or","start_offset": 34,"end_offset": 36,"type": "","position": 4},
+ {"token": "visit","start_offset": 37,"end_offset": 42,"type": "","position": 5},
+ {"token": "https://example.com","start_offset": 43,"end_offset": 62,"type": "","position": 6},
+ {"token": "for","start_offset": 63,"end_offset": 66,"type": "","position": 7},
+ {"token": "details","start_offset": 67,"end_offset": 74,"type": "","position": 8}
+ ]
+}
+```
+
+## Parameters
+
+The `uax_url_email` tokenizer can be configured with the following parameter.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`.
+
diff --git a/_analyzers/tokenizers/whitespace.md b/_analyzers/tokenizers/whitespace.md
new file mode 100644
index 0000000000..fb168304a7
--- /dev/null
+++ b/_analyzers/tokenizers/whitespace.md
@@ -0,0 +1,110 @@
+---
+layout: default
+title: Whitespace
+parent: Tokenizers
+nav_order: 160
+---
+
+# Whitespace tokenizer
+
+The `whitespace` tokenizer splits text on white space characters, such as spaces, tabs, and new lines. It treats each word separated by white space as a token and does not perform any additional analysis or normalization like lowercasing or punctuation removal.
+
+## Example usage
+
+The following example request creates a new index named `my_index` and configures an analyzer with a `whitespace` tokenizer:
+
+```json
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "tokenizer": {
+ "whitespace_tokenizer": {
+ "type": "whitespace"
+ }
+ },
+ "analyzer": {
+ "my_whitespace_analyzer": {
+ "type": "custom",
+ "tokenizer": "whitespace_tokenizer"
+ }
+ }
+ }
+ },
+ "mappings": {
+ "properties": {
+ "content": {
+ "type": "text",
+ "analyzer": "my_whitespace_analyzer"
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index/_analyze
+{
+ "analyzer": "my_whitespace_analyzer",
+ "text": "OpenSearch is fast! Really fast."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "OpenSearch",
+ "start_offset": 0,
+ "end_offset": 10,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "is",
+ "start_offset": 11,
+ "end_offset": 13,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "fast!",
+ "start_offset": 14,
+ "end_offset": 19,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "Really",
+ "start_offset": 20,
+ "end_offset": 26,
+ "type": "word",
+ "position": 3
+ },
+ {
+ "token": "fast.",
+ "start_offset": 27,
+ "end_offset": 32,
+ "type": "word",
+ "position": 4
+ }
+ ]
+}
+```
+
+## Parameters
+
+The `whitespace` tokenizer can be configured with the following parameter.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`.
+
diff --git a/_api-reference/common-parameters.md b/_api-reference/common-parameters.md
index 5b536ad992..ac3efbf4bf 100644
--- a/_api-reference/common-parameters.md
+++ b/_api-reference/common-parameters.md
@@ -123,4 +123,17 @@ Kilometers | `km` or `kilometers`
Meters | `m` or `meters`
Centimeters | `cm` or `centimeters`
Millimeters | `mm` or `millimeters`
-Nautical miles | `NM`, `nmi`, or `nauticalmiles`
\ No newline at end of file
+Nautical miles | `NM`, `nmi`, or `nauticalmiles`
+
+## `X-Opaque-Id` header
+
+You can specify an opaque identifier for any request using the `X-Opaque-Id` header. This identifier is used to track tasks and deduplicate deprecation warnings in server-side logs. This identifier is used to differentiate between callers sending requests to your OpenSearch cluster. Do not specify a unique value per request.
+
+#### Example request
+
+The following request adds an opaque ID to the request:
+
+```json
+curl -H "X-Opaque-Id: my-curl-client-1" -XGET localhost:9200/_tasks
+```
+{% include copy.html %}
diff --git a/_api-reference/document-apis/update-document.md b/_api-reference/document-apis/update-document.md
index ff17940cdb..8500fec101 100644
--- a/_api-reference/document-apis/update-document.md
+++ b/_api-reference/document-apis/update-document.md
@@ -14,6 +14,14 @@ redirect_from:
If you need to update a document's fields in your index, you can use the update document API operation. You can do so by specifying the new data you want to be in your index or by including a script in your request body, which OpenSearch runs to update the document. By default, the update operation only updates a document that exists in the index. If a document does not exist, the API returns an error. To _upsert_ a document (update the document that exists or index a new one), use the [upsert](#using-the-upsert-operation) operation.
+You cannot explicitly specify an ingest pipeline when calling the Update Document API. If a `default_pipeline` or `final_pipeline` is defined in your index, the following behavior applies:
+
+- **Upsert operations**: When indexing a new document, the `default_pipeline` and `final_pipeline` defined in the index are executed as specified.
+- **Update operations**: When updating an existing document, ingest pipeline execution is not recommended because it may produce erroneous results. Support for running ingest pipelines during update operations is deprecated and will be removed in version 3.0.0. If your index has a defined ingest pipeline, the update document operation will return the following deprecation warning:
+```
+the index [sample-index1] has a default ingest pipeline or a final ingest pipeline, the support of the ingest pipelines for update operation causes unexpected result and will be removed in 3.0.0
+```
+
## Path and HTTP methods
```json
diff --git a/_benchmark/reference/commands/aggregate.md b/_benchmark/reference/commands/aggregate.md
index 17612f1164..a891bf3edf 100644
--- a/_benchmark/reference/commands/aggregate.md
+++ b/_benchmark/reference/commands/aggregate.md
@@ -69,9 +69,30 @@ Aggregate test execution ID: aggregate_results_geonames_9aafcfb8-d3b7-4583-864e
-------------------------------
```
-The results will be aggregated into one test execution and stored under the ID shown in the output:
+The results will be aggregated into one test execution and stored under the ID shown in the output.
+### Additional options
- `--test-execution-id`: Define a unique ID for the aggregated test execution.
- `--results-file`: Write the aggregated results to the provided file.
- `--workload-repository`: Define the repository from which OpenSearch Benchmark will load workloads (default is `default`).
+## Aggregated results
+
+Aggregated results includes the following information:
+
+- **Relative Standard Deviation (RSD)**: For each metric an additional `mean_rsd` value shows the spread of results across test executions.
+- **Overall min/max values**: Instead of averaging minimum and maximum values, the aggregated result include `overall_min` and `overall_max` which reflect the true minimum/maximum across all test runs.
+- **Storage**: Aggregated test results are stored in a separate `aggregated_results` folder alongside the `test_executions` folder.
+
+The following example shows aggregated results:
+
+```json
+ "throughput": {
+ "overall_min": 29056.890292903263,
+ "mean": 50115.8603858536,
+ "median": 50099.54349684457,
+ "overall_max": 72255.15946248993,
+ "unit": "docs/s",
+ "mean_rsd": 59.426059705973664
+ },
+```
diff --git a/_config.yml b/_config.yml
index 0e45176320..3c6f737cc8 100644
--- a/_config.yml
+++ b/_config.yml
@@ -31,9 +31,6 @@ collections:
install-and-configure:
permalink: /:collection/:path/
output: true
- upgrade-to:
- permalink: /:collection/:path/
- output: true
im-plugin:
permalink: /:collection/:path/
output: true
@@ -94,6 +91,9 @@ collections:
data-prepper:
permalink: /:collection/:path/
output: true
+ migration-assistant:
+ permalink: /:collection/:path/
+ output: true
tools:
permalink: /:collection/:path/
output: true
@@ -137,11 +137,6 @@ opensearch_collection:
install-and-configure:
name: Install and upgrade
nav_fold: true
- upgrade-to:
- name: Migrate to OpenSearch
- # nav_exclude: true
- nav_fold: true
- # search_exclude: true
im-plugin:
name: Managing Indexes
nav_fold: true
@@ -213,6 +208,12 @@ clients_collection:
name: Clients
nav_fold: true
+migration_assistant_collection:
+ collections:
+ migration-assistant:
+ name: Migration Assistant
+ nav_fold: true
+
benchmark_collection:
collections:
benchmark:
@@ -252,6 +253,12 @@ defaults:
values:
section: "benchmark"
section-name: "Benchmark"
+ -
+ scope:
+ path: "_migration-assistant"
+ values:
+ section: "migration-assistant"
+ section-name: "Migration Assistant"
# Enable or disable the site search
# By default, just-the-docs enables its JSON file-based search. We also have an OpenSearch-driven search functionality.
diff --git a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md
index 9628bb6caf..ba574bdf7d 100644
--- a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md
+++ b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md
@@ -53,6 +53,7 @@ You can configure `random_cut_forest` mode with the following options.
| `sample_size` | `256` | 100--2500 | The sample size used in the ML algorithm. |
| `time_decay` | `0.1` | 0--1.0 | The time decay value used in the ML algorithm. Used as the mathematical expression `timeDecay` divided by `SampleSize` in the ML algorithm. |
| `type` | `metrics` | N/A | The type of data sent to the algorithm. |
+| `output_after` | 32 | N/A | Specifies the number of events to process before outputting any detected anomalies. |
| `version` | `1.0` | N/A | The algorithm version number. |
## Usage
diff --git a/_data-prepper/pipelines/configuration/sources/s3.md b/_data-prepper/pipelines/configuration/sources/s3.md
index db92718a36..7ca27ee500 100644
--- a/_data-prepper/pipelines/configuration/sources/s3.md
+++ b/_data-prepper/pipelines/configuration/sources/s3.md
@@ -104,7 +104,7 @@ Option | Required | Type | Description
`s3_select` | No | [s3_select](#s3_select) | The Amazon S3 Select configuration.
`scan` | No | [scan](#scan) | The S3 scan configuration.
`delete_s3_objects_on_read` | No | Boolean | When `true`, the S3 scan attempts to delete S3 objects after all events from the S3 object are successfully acknowledged by all sinks. `acknowledgments` should be enabled when deleting S3 objects. Default is `false`.
-`workers` | No | Integer | Configures the number of worker threads that the source uses to read data from S3. Leave this value as the default unless your S3 objects are less than 1 MB in size. Performance may decrease for larger S3 objects. This setting affects SQS-based sources and S3-Scan sources. Default is `1`.
+`workers` | No | Integer | Configures the number of worker threads (1--10) that the source uses to read data from S3. Leave this value as the default unless your S3 objects are less than 1 MB in size. Performance may decrease for larger S3 objects. This setting affects SQS-based sources and S3-Scan sources. Default is `1`.
diff --git a/_data-prepper/pipelines/expression-syntax.md b/_data-prepper/pipelines/expression-syntax.md
index 383b54c19b..07f68ee58e 100644
--- a/_data-prepper/pipelines/expression-syntax.md
+++ b/_data-prepper/pipelines/expression-syntax.md
@@ -30,6 +30,9 @@ The following table lists the supported operators. Operators are listed in order
|----------------------|-------------------------------------------------------|---------------|
| `()` | Priority expression | Left to right |
| `not` `+` `-`| Unary logical NOT Unary positive Unary negative | Right to left |
+| `*`, `/` | Multiplication and division operators | Left to right |
+| `+`, `-` | Addition and subtraction operators | Left to right |
+| `+` | String concatenation operator | Left to right |
| `<`, `<=`, `>`, `>=` | Relational operators | Left to right |
| `==`, `!=` | Equality operators | Left to right |
| `and`, `or` | Conditional expression | Left to right |
@@ -78,7 +81,6 @@ Conditional expressions allow you to combine multiple expressions or values usin
or
not
```
-{% include copy-curl.html %}
The following are some example conditional expressions:
@@ -91,9 +93,64 @@ not /status_code in {200, 202}
```
{% include copy-curl.html %}
+### Arithmetic expressions
+
+Arithmetic expressions enable basic mathematical operations like addition, subtraction, multiplication, and division. These expressions can be combined with conditional expressions to create more complex conditional statements. The available arithmetic operators are +, -, *, and /. The syntax for using the arithmetic operators is as follows:
+
+```
+ +
+ -
+ *
+ /
+```
+
+The following are example arithmetic expressions:
+
+```
+/value + length(/message)
+/bytes / 1024
+/value1 - /value2
+/TimeInSeconds * 1000
+```
+{% include copy-curl.html %}
+
+The following are some example arithmetic expressions used in conditional expressions :
+
+```
+/value + length(/message) > 200
+/bytes / 1024 < 10
+/value1 - /value2 != /value3 + /value4
+```
+{% include copy-curl.html %}
+
+### String concatenation expressions
+
+String concatenation expressions enable you to combine strings to create new strings. These concatenated strings can also be used within conditional expressions. The syntax for using string concatenation is as follows:
+
+```
+ +
+```
+
+The following are example string concatenation expressions:
+
+```
+/name + "suffix"
+"prefix" + /name
+"time of " + /timeInMs + " ms"
+```
+{% include copy-curl.html %}
+
+The following are example string concatenation expressions that can be used in conditional expressions:
+
+```
+/service + ".com" == /url
+"www." + /service != /url
+```
+{% include copy-curl.html %}
+
### Reserved symbols
-Reserved symbols are symbols that are not currently used in the expression syntax but are reserved for possible future functionality or extensions. Reserved symbols include `^`, `*`, `/`, `%`, `+`, `-`, `xor`, `=`, `+=`, `-=`, `*=`, `/=`, `%=`, `++`, `--`, and `${}`.
+Certain symbols, such as ^, %, xor, =, +=, -=, *=, /=, %=, ++, --, and ${}, are reserved for future functionality or extensions. Reserved symbols include `^`, `%`, `xor`, `=`, `+=`, `-=`, `*=`, `/=`, `%=`, `++`, `--`, and `${}`.
## Syntax components
@@ -170,6 +227,9 @@ White space is optional around relational operators, regex equality operators, e
| `()` | Priority expression | Yes | `/a==(/b==200)` `/a in ({200})` | `/status in({200})` |
| `in`, `not in` | Set operators | Yes | `/a in {200}` `/a not in {400}` | `/a in{200, 202}` `/a not in{400}` |
| `<`, `<=`, `>`, `>=` | Relational operators | No | `/status < 300` `/status>=300` | |
+| `+` | String concatenation operator | No | `/status_code + /message + "suffix"`
+| `+`, `-` | Arithmetic addition and subtraction operators | No | `/status_code + length(/message) - 2`
+| `*`, `/` | Multiplication and division operators | No | `/status_code * length(/message) / 3`
| `=~`, `!~` | Regex equality operators | No | `/msg =~ "^\w*$"` `/msg=~"^\w*$"` | |
| `==`, `!=` | Equality operators | No | `/status == 200` `/status_code==200` | |
| `and`, `or`, `not` | Conditional operators | Yes | `/a<300 and /b>200` | `/b<300and/b>200` |
diff --git a/_data/top_nav.yml b/_data/top_nav.yml
index 51d8138680..6552d90359 100644
--- a/_data/top_nav.yml
+++ b/_data/top_nav.yml
@@ -63,6 +63,8 @@ items:
url: /docs/latest/clients/
- label: Benchmark
url: /docs/latest/benchmark/
+ - label: Migration Assistant
+ url: /docs/latest/migration-assistant/
- label: Platform
url: /platform/index.html
children:
diff --git a/_field-types/supported-field-types/flat-object.md b/_field-types/supported-field-types/flat-object.md
index c9e59710e1..65d7c6dc8e 100644
--- a/_field-types/supported-field-types/flat-object.md
+++ b/_field-types/supported-field-types/flat-object.md
@@ -56,7 +56,8 @@ The flat object field type supports the following queries:
- [Multi-match]({{site.url}}{{site.baseurl}}/query-dsl/full-text/multi-match/)
- [Query string]({{site.url}}{{site.baseurl}}/query-dsl/full-text/query-string/)
- [Simple query string]({{site.url}}{{site.baseurl}}/query-dsl/full-text/simple-query-string/)
-- [Exists]({{site.url}}{{site.baseurl}}/query-dsl/term/exists/)
+- [Exists]({{site.url}}{{site.baseurl}}/query-dsl/term/exists/)
+- [Wildcard]({{site.url}}{{site.baseurl}}/query-dsl/term/wildcard/)
## Limitations
@@ -243,4 +244,4 @@ PUT /test-index/
```
{% include copy-curl.html %}
-Because `issue.number` is not part of the flat object, you can use it to aggregate and sort documents.
\ No newline at end of file
+Because `issue.number` is not part of the flat object, you can use it to aggregate and sort documents.
diff --git a/_im-plugin/index-transforms/transforms-apis.md b/_im-plugin/index-transforms/transforms-apis.md
index 37d2c035b5..7e0803c38b 100644
--- a/_im-plugin/index-transforms/transforms-apis.md
+++ b/_im-plugin/index-transforms/transforms-apis.md
@@ -177,8 +177,8 @@ The update operation supports the following query parameters:
Parameter | Description | Required
:---| :--- | :---
-`seq_no` | Only perform the transform operation if the last operation that changed the transform job has the specified sequence number. | Yes
-`primary_term` | Only perform the transform operation if the last operation that changed the transform job has the specified sequence term. | Yes
+`if_seq_no` | Only perform the transform operation if the last operation that changed the transform job has the specified sequence number. | Yes
+`if_primary_term` | Only perform the transform operation if the last operation that changed the transform job has the specified sequence term. | Yes
### Request body fields
diff --git a/_includes/cards.html b/_includes/cards.html
index 6d958e61a5..3fa1809506 100644
--- a/_includes/cards.html
+++ b/_includes/cards.html
@@ -30,8 +30,14 @@
Measure performance metrics for your OpenSearch cluster
Documentation →
+
+
+
+
Migration Assistant
+
Migrate to OpenSearch
+
Documentation →
+
-
diff --git a/_install-and-configure/install-opensearch/index.md b/_install-and-configure/install-opensearch/index.md
index 94c259667a..bfaf9897d6 100644
--- a/_install-and-configure/install-opensearch/index.md
+++ b/_install-and-configure/install-opensearch/index.md
@@ -29,7 +29,7 @@ The OpenSearch distribution for Linux ships with a compatible [Adoptium JDK](htt
OpenSearch Version | Compatible Java Versions | Bundled Java Version
:---------- | :-------- | :-----------
1.0--1.2.x | 11, 15 | 15.0.1+9
-1.3.x | 8, 11, 14 | 11.0.24+8
+1.3.x | 8, 11, 14 | 11.0.25+9
2.0.0--2.11.x | 11, 17 | 17.0.2+8
2.12.0+ | 11, 17, 21 | 21.0.5+11
diff --git a/_install-and-configure/plugins.md b/_install-and-configure/plugins.md
index e96b29e822..055d451081 100644
--- a/_install-and-configure/plugins.md
+++ b/_install-and-configure/plugins.md
@@ -10,9 +10,9 @@ redirect_from:
# Installing plugins
-OpenSearch comprises of a number of plugins that add features and capabilities to the core platform. The plugins available to you are dependent on how OpenSearch was installed and which plugins were subsequently added or removed. For example, the minimal distribution of OpenSearch enables only core functionality, such as indexing and search. Using the minimal distribution of OpenSearch is beneficial when you are working in a testing environment, have custom plugins, or are intending to integrate OpenSearch with other services.
+OpenSearch includes a number of plugins that add features and capabilities to the core platform. The plugins available to you are dependent on how OpenSearch was installed and which plugins were subsequently added or removed. For example, the minimal distribution of OpenSearch enables only core functionality, such as indexing and search. Using the minimal distribution of OpenSearch is beneficial when you are working in a testing environment, have custom plugins, or are intending to integrate OpenSearch with other services.
-The standard distribution of OpenSearch has much more functionality included. You can choose to add additional plugins or remove any of the plugins you don't need.
+The standard distribution of OpenSearch includes many more plugins offering much more functionality. You can choose to add additional plugins or remove any of the plugins you don't need.
For a list of the available plugins, see [Available plugins](#available-plugins).
diff --git a/_layouts/default.html b/_layouts/default.html
index d4d40d8cc4..7f2bf0a2a8 100755
--- a/_layouts/default.html
+++ b/_layouts/default.html
@@ -87,6 +87,8 @@
{% assign section = site.clients_collection.collections %}
{% elsif page.section == "benchmark" %}
{% assign section = site.benchmark_collection.collections %}
+ {% elsif page.section == "migration-assistant" %}
+ {% assign section = site.migration_assistant_collection.collections %}
{% endif %}
{% if section %}
diff --git a/_migration-assistant/deploying-migration-assistant/configuration-options.md b/_migration-assistant/deploying-migration-assistant/configuration-options.md
new file mode 100644
index 0000000000..7097d7e90e
--- /dev/null
+++ b/_migration-assistant/deploying-migration-assistant/configuration-options.md
@@ -0,0 +1,175 @@
+---
+layout: default
+title: Configuration options
+nav_order: 15
+parent: Deploying Migration Assistant
+---
+
+# Configuration options
+
+This page outlines the configuration options for three key migrations scenarios:
+
+1. **Metadata migration**
+2. **Backfill migration with `Reindex-from-Snapshot` (RFS)**
+3. **Live capture migration with Capture and Replay (C&R)**
+
+Each of these migrations depends on either a snapshot or a capture proxy. The following example `cdk.context.json` configurations are used by AWS Cloud Development Kit (AWS CDK) to deploy and configure Migration Assistant for OpenSearch, shown as separate blocks for each migration type. If you are performing a migration applicable to multiple scenarios, these options can be combined.
+
+
+For a complete list of configuration options, see [opensearch-migrations-options.md](https://github.com/opensearch-project/opensearch-migrations/blob/main/deployment/cdk/opensearch-service-migration/options.md). If you need a configuration option that is not found on this page, create an issue in the [OpenSearch Migrations repository](https://github.com/opensearch-project/opensearch-migrations/issues).
+{: .tip }
+
+Options for the source cluster endpoint, target cluster endpoint, and existing virtual private cloud (VPC) should be configured in order for the migration tools to function effectively.
+
+## Shared configuration options
+
+Each migration configuration shares the following options.
+
+
+| Name | Example | Description |
+| :--- | :--- | :--- |
+| `sourceClusterEndpoint` | `"https://source-cluster.elb.us-east-1.endpoint.com"` | The endpoint for the source cluster. |
+| `targetClusterEndpoint` | `"https://vpc-demo-opensearch-cluster-cv6hggdb66ybpk4kxssqt6zdhu.us-west-2.es.amazonaws.com:443"` | The endpoint for the target cluster. Required if using an existing target cluster for the migration instead of creating a new one. |
+| `vpcId` | `"vpc-123456789abcdefgh"` | The ID of the existing VPC in which the migration resources will be stored. The VPC must have at least two private subnets that span two Availability Zones. |
+
+
+## Backfill migration using RFS
+
+The following CDK performs a backfill migrations using RFS:
+
+```json
+{
+ "backfill-migration": {
+ "stage": "dev",
+ "vpcId": ,
+ "sourceCluster": {
+ "endpoint": ,
+ "version": "ES 7.10",
+ "auth": {"type": "none"}
+ },
+ "targetCluster": {
+ "endpoint": ,
+ "auth": {
+ "type": "basic",
+ "username": ,
+ "passwordFromSecretArn":
+ }
+ },
+ "reindexFromSnapshotServiceEnabled": true,
+ "reindexFromSnapshotExtraArgs": "",
+ "artifactBucketRemovalPolicy": "DESTROY"
+ }
+}
+```
+{% include copy.html %}
+
+Performing an RFS backfill migration requires an existing snapshot.
+
+
+The RFS configuration uses the following options. All options are optional.
+
+| Name | Example | Description |
+| :--- | :--- | :--- |
+| `reindexFromSnapshotServiceEnabled` | `true` | Enables deployment and configuration of the RFS ECS service. |
+| `reindexFromSnapshotExtraArgs` | `"--target-aws-region us-east-1 --target-aws-service-signing-name es"` | Extra arguments for the Document Migration command, with space separation. See [RFS Extra Arguments](https://github.com/opensearch-project/opensearch-migrations/blob/main/DocumentsFromSnapshotMigration/README.md#arguments) for more information. You can pass `--no-insecure` to remove the `--insecure` flag. |
+
+To view all available arguments for `reindexFromSnapshotExtraArgs`, see [Snapshot migrations README](https://github.com/opensearch-project/opensearch-migrations/blob/main/DocumentsFromSnapshotMigration/README.md#arguments). At a minimum, no extra arguments may be needed.
+
+## Live capture migration with C&R
+
+The following sample CDK performs a live capture migration with C&R:
+
+```json
+{
+ "live-capture-migration": {
+ "stage": "dev",
+ "vpcId": ,
+ "sourceCluster": {
+ "endpoint": ,
+ "version": "ES 7.10",
+ "auth": {"type": "none"}
+ },
+ "targetCluster": {
+ "endpoint": ,
+ "auth": {
+ "type": "basic",
+ "username": ,
+ "passwordFromSecretArn":
+ }
+ },
+ "captureProxyServiceEnabled": true,
+ "captureProxyExtraArgs": "",
+ "trafficReplayerServiceEnabled": true,
+ "trafficReplayerExtraArgs": "",
+ "artifactBucketRemovalPolicy": "DESTROY"
+ }
+}
+```
+{% include copy.html %}
+
+Performing a live capture migration requires that a Capture Proxy be configured to capture incoming traffic and send it to the target cluster using the Traffic Replayer service. For arguments available in `captureProxyExtraArgs`, refer to the `@Parameter` fields [here](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/CaptureProxy.java). For `trafficReplayerExtraArgs`, refer to the `@Parameter` fields [here](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java). At a minimum, no extra arguments may be needed.
+
+
+| Name | Example | Description |
+| :--- | :--- | :--- |
+| `captureProxyServiceEnabled` | `true` | Enables the Capture Proxy service deployment using an AWS CloudFormation stack. |
+| `captureProxyExtraArgs` | `"--suppressCaptureForHeaderMatch user-agent .*elastic-java/7.17.0.*"` | Extra arguments for the Capture Proxy command, including options specified by the [Capture Proxy](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/CaptureProxy.java). |
+| `trafficReplayerServiceEnabled` | `true` | Enables the Traffic Replayer service deployment using a CloudFormation stack. |
+| `trafficReplayerExtraArgs` | `"--sigv4-auth-header-service-region es,us-east-1 --speedup-factor 5"` | Extra arguments for the Traffic Replayer command, including options for auth headers and other parameters specified by the [Traffic Replayer](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java). |
+
+
+For arguments available in `captureProxyExtraArgs`, see the `@Parameter` fields in [`CaptureProxy.java`](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/CaptureProxy.java). For `trafficReplayerExtraArgs`, see the `@Parameter` fields in [TrafficReplayer.java](https://github.com/opensearch-project/opensearch-migrations/blob/main/TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java).
+
+
+## Cluster authentication options
+
+Both the source and target cluster can use no authentication, authentication limited to VPC, basic authentication with a username and password, or AWS Signature Version 4 scoped to a user or role.
+
+### No authentication
+
+```json
+ "sourceCluster": {
+ "endpoint": ,
+ "version": "ES 7.10",
+ "auth": {"type": "none"}
+ }
+```
+{% include copy.html %}
+
+### Basic authentication
+
+```json
+ "sourceCluster": {
+ "endpoint": ,
+ "version": "ES 7.10",
+ "auth": {
+ "type": "basic",
+ "username": ,
+ "passwordFromSecretArn":
+ }
+ }
+```
+{% include copy.html %}
+
+### Signature Version 4 authentication
+
+```json
+ "sourceCluster": {
+ "endpoint": ,
+ "version": "ES 7.10",
+ "auth": {
+ "type": "sigv4",
+ "region": "us-east-1",
+ "serviceSigningName": "es"
+ }
+ }
+```
+{% include copy.html %}
+
+The `serviceSigningName` can be `es` for an Elasticsearch or OpenSearch domain, or `aoss` for an OpenSearch Serverless collection.
+
+All of these authentication options apply to both source and target clusters.
+
+## Network configuration
+
+The migration tooling expects the source cluster, target cluster, and migration resources to exist in the same VPC. If this is not the case, manual networking setup outside of this documentation is likely required.
diff --git a/_migration-assistant/deploying-migration-assistant/getting-started-data-migration.md b/_migration-assistant/deploying-migration-assistant/getting-started-data-migration.md
new file mode 100644
index 0000000000..f260a28701
--- /dev/null
+++ b/_migration-assistant/deploying-migration-assistant/getting-started-data-migration.md
@@ -0,0 +1,355 @@
+---
+layout: default
+title: Getting started with data migration
+parent: Deploying Migration Assistant
+nav_order: 10
+redirect_from:
+ - /upgrade-to/upgrade-to/
+ - /upgrade-to/snapshot-migrate/
+ - /migration-assistant/getting-started-with-data-migration/
+---
+
+# Getting started with data migration
+
+This quickstart outlines how to deploy Migration Assistant for OpenSearch and execute an existing data migration using `Reindex-from-Snapshot` (RFS). It uses AWS for illustrative purposes. However, the steps can be modified for use with other cloud providers.
+
+## Prerequisites and assumptions
+
+Before using this quickstart, make sure you fulfill the following prerequisites:
+
+* Verify that your migration path [is supported]({{site.url}}{{site.baseurl}}/migration-assistant/is-migration-assistant-right-for-you/#migration-paths). Note that we test with the exact versions specified, but you should be able to migrate data on alternative minor versions as long as the major version is supported.
+* The source cluster must be deployed Amazon Simple Storage Service (Amazon S3) plugin.
+* The target cluster must be deployed.
+
+The steps in this guide assume the following:
+
+* In this guide, a snapshot will be taken and stored in Amazon S3; the following assumptions are made about this snapshot:
+ * The `_source` flag is enabled on all indexes to be migrated.
+ * The snapshot includes the global cluster state (`include_global_state` is `true`).
+ * Shard sizes of up to approximately 80 GB are supported. Larger shards cannot be migrated. If this presents challenges for your migration, contact the [migration team](https://opensearch.slack.com/archives/C054JQ6UJFK).
+* Migration Assistant will be installed in the same AWS Region and have access to both the source snapshot and target cluster.
+
+---
+
+## Step 1: Install Bootstrap on an Amazon EC2 instance (~10 minutes)
+
+To begin your migration, use the following steps to install a `bootstrap` box on an Amazon Elastic Compute Cloud (Amazon EC2) instance. The instance uses AWS CloudFormation to create and manage the stack.
+
+1. Log in to the target AWS account in which you want to deploy Migration Assistant.
+2. From the browser where you are logged in to your target AWS account, right-click [here](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?templateURL=https://solutions-reference.s3.amazonaws.com/migration-assistant-for-amazon-opensearch-service/latest/migration-assistant-for-amazon-opensearch-service.template&redirectId=SolutionWeb) to load the CloudFormation template from a new browser tab.
+3. Follow the CloudFormation stack wizard:
+ * **Stack Name:** `MigrationBootstrap`
+ * **Stage Name:** `dev`
+ * Choose **Next** after each step > **Acknowledge** > **Submit**.
+4. Verify that the Bootstrap stack exists and is set to `CREATE_COMPLETE`. This process takes around 10 minutes to complete.
+
+---
+
+## Step 2: Set up Bootstrap instance access (~5 minutes)
+
+Use the following steps to set up Bootstrap instance access:
+
+1. After deployment, find the EC2 instance ID for the `bootstrap-dev-instance`.
+2. Create an AWS Identity and Access Management (IAM) policy using the following snippet, replacing ``, ``, ``, and `` with your information:
+
+ ```json
+ {
+ "Version": "2012-10-17",
+ "Statement": [
+ {
+ "Effect": "Allow",
+ "Action": "ssm:StartSession",
+ "Resource": [
+ "arn:aws:ec2:::instance/",
+ "arn:aws:ssm:::document/BootstrapShellDoc--"
+ ]
+ }
+ ]
+ }
+ ```
+ {% include copy.html %}
+
+3. Name the policy, for example, `SSM-OSMigrationBootstrapAccess`, and then create the policy by selecting **Create policy**.
+
+---
+
+## Step 3: Log in to Bootstrap and building Migration Assistant (~15 minutes)
+
+Next, log in to Bootstrap and build Migration Assistant using the following steps.
+
+### Prerequisites
+
+To use these steps, make sure you fulfill the following prerequisites:
+
+* The AWS Command Line Interface (AWS CLI) and AWS Session Manager plugin are installed on your instance.
+* The AWS credentials are configured (`aws configure`) for your instance.
+
+### Steps
+
+1. Load AWS credentials into your terminal.
+2. Log in to the instance using the following command, replacing `` and `` with your instance ID and Region:
+
+ ```bash
+ aws ssm start-session --document-name BootstrapShellDoc-- --target --region [--profile ]
+ ```
+ {% include copy.html %}
+
+3. Once logged in, run the following command from the shell of the Bootstrap instance in the `/opensearch-migrations` directory:
+
+ ```bash
+ ./initBootstrap.sh && cd deployment/cdk/opensearch-service-migration
+ ```
+ {% include copy.html %}
+
+4. After a successful build, note the path for infrastructure deployment, which will be used in the next step.
+
+---
+
+## Step 4: Configure and deploy RFS (~20 minutes)
+
+Use the following steps to configure and deploy RFS:
+
+1. Add the target cluster password to AWS Secrets Manager as an unstructured string. Be sure to copy the secret Amazon Resource Name (ARN) for use during deployment.
+2. From the same shell as the Bootstrap instance, modify the `cdk.context.json` file located in the `/opensearch-migrations/deployment/cdk/opensearch-service-migration` directory:
+
+ ```json
+ {
+ "migration-assistant": {
+ "vpcId": "",
+ "targetCluster": {
+ "endpoint": "",
+ "auth": {
+ "type": "basic",
+ "username": "",
+ "passwordFromSecretArn": ""
+ }
+ },
+ "sourceCluster": {
+ "endpoint": "