Reorganize built-in analyzer section (#8922) (#8924)

opensearch-project · Dec 10, 2024 · b496855 · b496855
1 parent d4c56d9
commit b496855
Show file tree

Hide file tree

Showing 18 changed files with 35 additions and 28 deletions.
diff --git a/_analyzers/custom-analyzer.md b/_analyzers/custom-analyzer.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Creating a custom analyzer
-nav_order: 90
+nav_order: 40
 parent: Analyzers
 ---
 

diff --git a/_analyzers/language-analyzers/index.md b/_analyzers/language-analyzers/index.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Language analyzers
-nav_order: 100
+nav_order: 140
 parent: Analyzers
 has_children: true
 has_toc: true

diff --git a/_analyzers/fingerprint.md → ...lyzers/supported-analyzers/fingerprint.md b/_analyzers/fingerprint.md → ...lyzers/supported-analyzers/fingerprint.md
@@ -1,7 +1,8 @@
 ---
 layout: default
 title: Fingerprint analyzer
-nav_order: 110
+parent: Analyzers
+nav_order: 60
 ---
 
 # Fingerprint analyzer

diff --git a/_analyzers/supported-analyzers/index.md b/_analyzers/supported-analyzers/index.md
@@ -18,14 +18,14 @@ The following table lists the built-in analyzers that OpenSearch provides. The l
 
 Analyzer | Analysis performed | Analyzer output 
 :--- | :--- | :---
-**Standard** (default) | - Parses strings into tokens at word boundaries <br> - Removes most punctuation <br> - Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
-**Simple** | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Converts tokens to lowercase  | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
-**Whitespace** | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
-**Stop** | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Removes stop words <br> - Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`]
-**Keyword** (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`]
-**Pattern** | - Parses strings into tokens using regular expressions <br> - Supports converting strings to lowercase <br> - Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
+[**Standard**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/standard/) (default) | - Parses strings into tokens at word boundaries <br> - Removes most punctuation <br> - Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
+[**Simple**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/simple/) | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Converts tokens to lowercase  | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
+[**Whitespace**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/whitespace/) | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
+[**Stop**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/stop/) | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Removes stop words <br> - Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`]
+[**Keyword**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/keyword/) (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`]
+[**Pattern**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/pattern/)| - Parses strings into tokens using regular expressions <br> - Supports converting strings to lowercase <br> - Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
 [**Language**]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/index/) | Performs analysis specific to a certain language (for example, `english`). | [`fun`, `contribut`, `brand`, `new`, `pr`, `2`, `opensearch`]
-**Fingerprint** | - Parses strings on any non-letter character <br> - Normalizes characters by converting them to ASCII <br> - Converts tokens to lowercase <br> - Sorts, deduplicates, and concatenates tokens into a single token <br> - Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`] <br> Note that the apostrophe was converted to its ASCII counterpart.
+[**Fingerprint**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/fingerprint/) | - Parses strings on any non-letter character <br> - Normalizes characters by converting them to ASCII <br> - Converts tokens to lowercase <br> - Sorts, deduplicates, and concatenates tokens into a single token <br> - Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`] <br> Note that the apostrophe was converted to its ASCII counterpart.
 
 ## Language analyzers
 
@@ -37,5 +37,5 @@ The following table lists the additional analyzers that OpenSearch supports.
 
 | Analyzer       | Analysis performed                                                                                       |
 |:---------------|:---------------------------------------------------------------------------------------------------------|
-| `phone`        | An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) for parsing phone numbers.  |
-| `phone-search` | A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) for parsing phone numbers. |
+| [`phone`]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/phone-analyzers/#the-phone-analyzer)       | An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) for parsing phone numbers.  |
+| [`phone-search`]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/phone-analyzers/#the-phone-search-analyzer) | A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) for parsing phone numbers. |
diff --git a/_analyzers/keyword.md → _analyzers/supported-analyzers/keyword.md b/_analyzers/keyword.md → _analyzers/supported-analyzers/keyword.md
@@ -1,6 +1,7 @@
 ---
 layout: default
 title: Keyword analyzer
+parent: Analyzers
 nav_order: 80
 ---
 

diff --git a/_analyzers/pattern.md → _analyzers/supported-analyzers/pattern.md b/_analyzers/pattern.md → _analyzers/supported-analyzers/pattern.md
@@ -1,6 +1,7 @@
 ---
 layout: default
 title: Pattern analyzer
+parent: Analyzers
 nav_order: 90
 ---
 

diff --git a/_analyzers/supported-analyzers/phone-analyzers.md b/_analyzers/supported-analyzers/phone-analyzers.md
@@ -1,6 +1,6 @@
 ---
 layout: default
-title: Phone number
+title: Phone number analyzers
 parent: Analyzers
 nav_order: 140
 ---

diff --git a/_analyzers/simple.md → _analyzers/supported-analyzers/simple.md b/_analyzers/simple.md → _analyzers/supported-analyzers/simple.md
@@ -1,7 +1,8 @@
 ---
 layout: default
 title: Simple analyzer
-nav_order: 50
+parent: Analyzers
+nav_order: 100
 ---
 
 # Simple analyzer

diff --git a/_analyzers/standard.md → _analyzers/supported-analyzers/standard.md b/_analyzers/standard.md → _analyzers/supported-analyzers/standard.md
@@ -1,7 +1,8 @@
 ---
 layout: default
 title: Standard analyzer
-nav_order: 40
+parent: Analyzers
+nav_order: 50
 ---
 
 # Standard analyzer

diff --git a/_analyzers/stop.md → _analyzers/supported-analyzers/stop.md b/_analyzers/stop.md → _analyzers/supported-analyzers/stop.md
@@ -1,7 +1,8 @@
 ---
 layout: default
 title: Stop analyzer
-nav_order: 70
+parent: Analyzers
+nav_order: 110
 ---
 
 # Stop analyzer

diff --git a/_analyzers/whitespace.md → _analyzers/supported-analyzers/whitespace.md b/_analyzers/whitespace.md → _analyzers/supported-analyzers/whitespace.md
@@ -1,7 +1,8 @@
 ---
 layout: default
 title: Whitespace analyzer
-nav_order: 60
+parent: Analyzers
+nav_order: 120
 ---
 
 # Whitespace analyzer

diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md
@@ -63,5 +63,5 @@ Token filter | Underlying Lucene token filter|  Description
 [`truncate`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/truncate/) | [TruncateTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens with lengths exceeding the specified character limit. 
 `unique` | N/A | Ensures each token is unique by removing duplicate tokens from a stream. 
 [`uppercase`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/uppercase/) | [UpperCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to uppercase. 
-[`word_delimiter`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter/) | [WordDelimiterFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. 
-[`word_delimiter_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter-graph/) | [WordDelimiterGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. Assigns a `positionLength` attribute to multi-position tokens.
+[`word_delimiter`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter/) | [WordDelimiterFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens on non-alphanumeric characters and performs normalization based on the specified rules. 
+[`word_delimiter_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter-graph/) | [WordDelimiterGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens on non-alphanumeric characters and performs normalization based on the specified rules. Assigns a `positionLength` attribute to multi-position tokens.
diff --git a/_analyzers/token-filters/word-delimiter-graph.md b/_analyzers/token-filters/word-delimiter-graph.md
@@ -7,7 +7,7 @@ nav_order: 480
 
 # Word delimiter graph token filter
 
-The `word_delimiter_graph` token filter is used to split tokens at predefined characters and also offers optional token normalization based on customizable rules.
+The `word_delimiter_graph` token filter is used to splits token on predefined characters and also offers optional token normalization based on customizable rules.
 
 The `word_delimiter_graph` filter is used to remove punctuation from complex identifiers like part numbers or product IDs. In such cases, it is best used with the `keyword` tokenizer. For hyphenated words, use the `synonym_graph` token filter instead of the `word_delimiter_graph` filter because users frequently search for these terms both with and without hyphens.
 {: .note}
@@ -44,7 +44,7 @@ Parameter | Required/Optional | Data type | Description
 `split_on_case_change` | Optional | Boolean | Splits tokens where consecutive letters have different cases (one is lowercase and the other is uppercase). For example, `"OpenSearch"` becomes `[ Open, Search ]`. Default is `true`.
 `split_on_numerics` | Optional | Boolean | Splits tokens where there are consecutive letters and numbers. For example `"v8engine"` will become `[ v, 8, engine ]`. Default is `true`.
 `stem_english_possessive` | Optional | Boolean | Removes English possessive endings, such as `'s`. Default is `true`.
-`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split at hyphens. Valid types are: <br> - `ALPHA`: alphabetical <br> - `ALPHANUM`: alphanumeric <br> - `DIGIT`: numeric <br> - `LOWER`: lowercase alphabetical <br> - `SUBWORD_DELIM`: non-alphanumeric delimiter <br> - `UPPER`: uppercase alphabetical
+`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split on hyphens. Valid types are: <br> - `ALPHA`: alphabetical <br> - `ALPHANUM`: alphanumeric <br> - `DIGIT`: numeric <br> - `LOWER`: lowercase alphabetical <br> - `SUBWORD_DELIM`: non-alphanumeric delimiter <br> - `UPPER`: uppercase alphabetical
 `type_table_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing a custom character map. The map specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For valid types, see `type_table`.
 
 ## Example

diff --git a/_analyzers/token-filters/word-delimiter.md b/_analyzers/token-filters/word-delimiter.md
@@ -7,7 +7,7 @@ nav_order: 470
 
 # Word delimiter token filter
 
-The `word_delimiter` token filter is used to split tokens at predefined characters and also offers optional token normalization based on customizable rules. 
+The `word_delimiter` token filter is used to splits token on predefined characters and also offers optional token normalization based on customizable rules. 
 
 We recommend using the `word_delimiter_graph` filter instead of the `word_delimiter` filter whenever possible because the `word_delimiter` filter sometimes produces invalid token graphs. For more information about the differences between the two filters, see [Differences between the `word_delimiter_graph` and `word_delimiter` filters]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter-graph/#differences-between-the-word_delimiter_graph-and-word_delimiter-filters).
 {: .important}
@@ -45,7 +45,7 @@ Parameter | Required/Optional | Data type | Description
 `split_on_case_change` | Optional | Boolean | Splits tokens where consecutive letters have different cases (one is lowercase and the other is uppercase). For example, `"OpenSearch"` becomes `[ Open, Search ]`. Default is `true`.
 `split_on_numerics` | Optional | Boolean | Splits tokens where there are consecutive letters and numbers. For example `"v8engine"` will become `[ v, 8, engine ]`. Default is `true`.
 `stem_english_possessive` | Optional | Boolean | Removes English possessive endings, such as `'s`. Default is `true`.
-`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split at hyphens. Valid types are: <br> - `ALPHA`: alphabetical <br> - `ALPHANUM`: alphanumeric <br> - `DIGIT`: numeric <br> - `LOWER`: lowercase alphabetical <br> - `SUBWORD_DELIM`: non-alphanumeric delimiter <br> - `UPPER`: uppercase alphabetical
+`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split on hyphens. Valid types are: <br> - `ALPHA`: alphabetical <br> - `ALPHANUM`: alphanumeric <br> - `DIGIT`: numeric <br> - `LOWER`: lowercase alphabetical <br> - `SUBWORD_DELIM`: non-alphanumeric delimiter <br> - `UPPER`: uppercase alphabetical
 `type_table_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing a custom character map. The map specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For valid types, see `type_table`.
 
 ## Example

diff --git a/_analyzers/tokenizers/index.md b/_analyzers/tokenizers/index.md
@@ -56,7 +56,7 @@ Tokenizer | Description | Example
 `keyword` | - No-op tokenizer <br> - Outputs the entire string unchanged <br> - Can be combined with token filters, like lowercase, to normalize terms | `My repo` <br>becomes<br> `My repo`
 `pattern` | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms <br> - Uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) | `https://opensearch.org/forum` <br>becomes<br> [`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`)<br>  Can be configured with a regex pattern
 `simple_pattern` | - Uses a regular expression pattern to return matching text as terms <br>  - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)  <br> - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions |  Returns an empty array by default <br> Must be configured with a pattern because the pattern defaults to an empty string
-`simple_pattern_split` | - Uses a regular expression pattern to split the text at matches rather than returning the matches as terms  <br>  - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)  <br> - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default<br> Must be configured with a pattern
+`simple_pattern_split` | - Uses a regular expression pattern to split the text on matches rather than returning the matches as terms  <br>  - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)  <br> - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default<br> Must be configured with a pattern
 `char_group` | - Parses on a set of configurable characters <br> - Faster than tokenizers that run regular expressions | No-op by default<br> Must be configured with a list of characters
 `path_hierarchy` | - Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy | `one/two/three` <br>becomes<br> [`one`, `one/two`, `one/two/three`]
 

diff --git a/_analyzers/tokenizers/pattern.md b/_analyzers/tokenizers/pattern.md
@@ -11,7 +11,7 @@ The `pattern` tokenizer is a highly flexible tokenizer that allows you to split
 
 ## Example usage
 
-The following example request creates a new index named `my_index` and configures an analyzer with a `pattern` tokenizer. The tokenizer splits text at `-`, `_`, or `.` characters:
+The following example request creates a new index named `my_index` and configures an analyzer with a `pattern` tokenizer. The tokenizer splits text on `-`, `_`, or `.` characters:
 
 ```json
 PUT /my_index
@@ -102,7 +102,7 @@ Parameter | Required/Optional | Data type | Description
 :--- | :--- | :--- | :--- 
 `pattern` | Optional | String | The pattern used to split text into tokens, specified using a [Java regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). Default is `\W+`.
 `flags` | Optional | String | Configures pipe-separated [flags](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary) to apply to the regular expression, for example, `"CASE_INSENSITIVE|MULTILINE|DOTALL"`. 
-`group` | Optional | Integer | Specifies the capture group to be used as a token. Default is `-1` (split at a match).
+`group` | Optional | Integer | Specifies the capture group to be used as a token. Default is `-1` (split on a match).
 
 ## Example using a group parameter
 

diff --git a/_analyzers/tokenizers/simple-pattern-split.md b/_analyzers/tokenizers/simple-pattern-split.md
@@ -13,7 +13,7 @@ The tokenizer uses the matched parts of the input text (based on the regular exp
 
 ## Example usage
 
-The following example request creates a new index named `my_index` and configures an analyzer with a `simple_pattern_split` tokenizer. The tokenizer is configured to split text at hyphens:
+The following example request creates a new index named `my_index` and configures an analyzer with a `simple_pattern_split` tokenizer. The tokenizer is configured to split text on hyphens:
 
 ```json
 PUT /my_index

diff --git a/_analyzers/tokenizers/whitespace.md b/_analyzers/tokenizers/whitespace.md
@@ -7,7 +7,7 @@ nav_order: 160
 
 # Whitespace tokenizer
 
-The `whitespace` tokenizer splits text at white space characters, such as spaces, tabs, and new lines. It treats each word separated by white space as a token and does not perform any additional analysis or normalization like lowercasing or punctuation removal.
+The `whitespace` tokenizer splits text on white space characters, such as spaces, tabs, and new lines. It treats each word separated by white space as a token and does not perform any additional analysis or normalization like lowercasing or punctuation removal.
 
 ## Example usage