From 645780d0b7f1a947dcf3f6acaa6d8e483e8f0661 Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Tue, 10 Dec 2024 12:45:33 -0500 Subject: [PATCH 1/2] Reorganize built-in analyzer section Signed-off-by: Fanit Kolchina --- _analyzers/custom-analyzer.md | 2 +- _analyzers/language-analyzers/index.md | 2 +- .../{ => supported-analyzers}/fingerprint.md | 3 ++- _analyzers/supported-analyzers/index.md | 14 +++++++------- _analyzers/{ => supported-analyzers}/keyword.md | 1 + _analyzers/{ => supported-analyzers}/pattern.md | 1 + _analyzers/supported-analyzers/phone-analyzers.md | 2 +- _analyzers/{ => supported-analyzers}/simple.md | 3 ++- _analyzers/{ => supported-analyzers}/standard.md | 3 ++- _analyzers/{ => supported-analyzers}/stop.md | 3 ++- _analyzers/{ => supported-analyzers}/whitespace.md | 3 ++- _analyzers/token-filters/index.md | 4 ++-- _analyzers/token-filters/word-delimiter-graph.md | 4 ++-- _analyzers/token-filters/word-delimiter.md | 4 ++-- _analyzers/tokenizers/index.md | 2 +- _analyzers/tokenizers/pattern.md | 4 ++-- _analyzers/tokenizers/simple-pattern-split.md | 2 +- _analyzers/tokenizers/whitespace.md | 2 +- 18 files changed, 33 insertions(+), 26 deletions(-) rename _analyzers/{ => supported-analyzers}/fingerprint.md (98%) rename _analyzers/{ => supported-analyzers}/keyword.md (98%) rename _analyzers/{ => supported-analyzers}/pattern.md (99%) rename _analyzers/{ => supported-analyzers}/simple.md (98%) rename _analyzers/{ => supported-analyzers}/standard.md (98%) rename _analyzers/{ => supported-analyzers}/stop.md (99%) rename _analyzers/{ => supported-analyzers}/whitespace.md (98%) diff --git a/_analyzers/custom-analyzer.md b/_analyzers/custom-analyzer.md index b808268f66..c456f3d826 100644 --- a/_analyzers/custom-analyzer.md +++ b/_analyzers/custom-analyzer.md @@ -1,7 +1,7 @@ --- layout: default title: Creating a custom analyzer -nav_order: 90 +nav_order: 40 parent: Analyzers --- diff --git a/_analyzers/language-analyzers/index.md b/_analyzers/language-analyzers/index.md index 89a4a42254..cc53c1cdac 100644 --- a/_analyzers/language-analyzers/index.md +++ b/_analyzers/language-analyzers/index.md @@ -1,7 +1,7 @@ --- layout: default title: Language analyzers -nav_order: 100 +nav_order: 140 parent: Analyzers has_children: true has_toc: true diff --git a/_analyzers/fingerprint.md b/_analyzers/supported-analyzers/fingerprint.md similarity index 98% rename from _analyzers/fingerprint.md rename to _analyzers/supported-analyzers/fingerprint.md index dd8027f037..267e16c039 100644 --- a/_analyzers/fingerprint.md +++ b/_analyzers/supported-analyzers/fingerprint.md @@ -1,7 +1,8 @@ --- layout: default title: Fingerprint analyzer -nav_order: 110 +parent: Analyzers +nav_order: 60 --- # Fingerprint analyzer diff --git a/_analyzers/supported-analyzers/index.md b/_analyzers/supported-analyzers/index.md index 43e41b8d6a..f67ba68635 100644 --- a/_analyzers/supported-analyzers/index.md +++ b/_analyzers/supported-analyzers/index.md @@ -18,14 +18,14 @@ The following table lists the built-in analyzers that OpenSearch provides. The l Analyzer | Analysis performed | Analyzer output :--- | :--- | :--- -**Standard** (default) | - Parses strings into tokens at word boundaries
- Removes most punctuation
- Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] -**Simple** | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`] -**Whitespace** | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`] -**Stop** | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Removes stop words
- Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`] -**Keyword** (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`] -**Pattern** | - Parses strings into tokens using regular expressions
- Supports converting strings to lowercase
- Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] +[**Standard**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/standard/) (default) | - Parses strings into tokens at word boundaries
- Removes most punctuation
- Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] +[**Simple**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/simple/) | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`] +[**Whitespace**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/whitespace/) | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`] +[**Stop**](({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/stop/)) | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Removes stop words
- Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`] +[**Keyword**](({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/keyword/)) (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`] +[**Pattern**](({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/pattern/))| - Parses strings into tokens using regular expressions
- Supports converting strings to lowercase
- Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] [**Language**]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/index/) | Performs analysis specific to a certain language (for example, `english`). | [`fun`, `contribut`, `brand`, `new`, `pr`, `2`, `opensearch`] -**Fingerprint** | - Parses strings on any non-letter character
- Normalizes characters by converting them to ASCII
- Converts tokens to lowercase
- Sorts, deduplicates, and concatenates tokens into a single token
- Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`]
Note that the apostrophe was converted to its ASCII counterpart. +[**Fingerprint**](({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/fingerprint/)) | - Parses strings on any non-letter character
- Normalizes characters by converting them to ASCII
- Converts tokens to lowercase
- Sorts, deduplicates, and concatenates tokens into a single token
- Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`]
Note that the apostrophe was converted to its ASCII counterpart. ## Language analyzers diff --git a/_analyzers/keyword.md b/_analyzers/supported-analyzers/keyword.md similarity index 98% rename from _analyzers/keyword.md rename to _analyzers/supported-analyzers/keyword.md index 3aec99d1d4..00c314d0c4 100644 --- a/_analyzers/keyword.md +++ b/_analyzers/supported-analyzers/keyword.md @@ -1,6 +1,7 @@ --- layout: default title: Keyword analyzer +parent: Analyzers nav_order: 80 --- diff --git a/_analyzers/pattern.md b/_analyzers/supported-analyzers/pattern.md similarity index 99% rename from _analyzers/pattern.md rename to _analyzers/supported-analyzers/pattern.md index 0d67999b82..bc3cb9a306 100644 --- a/_analyzers/pattern.md +++ b/_analyzers/supported-analyzers/pattern.md @@ -1,6 +1,7 @@ --- layout: default title: Pattern analyzer +parent: Analyzers nav_order: 90 --- diff --git a/_analyzers/supported-analyzers/phone-analyzers.md b/_analyzers/supported-analyzers/phone-analyzers.md index f24b7cf328..d94bfe192f 100644 --- a/_analyzers/supported-analyzers/phone-analyzers.md +++ b/_analyzers/supported-analyzers/phone-analyzers.md @@ -1,6 +1,6 @@ --- layout: default -title: Phone number +title: Phone number analyzers parent: Analyzers nav_order: 140 --- diff --git a/_analyzers/simple.md b/_analyzers/supported-analyzers/simple.md similarity index 98% rename from _analyzers/simple.md rename to _analyzers/supported-analyzers/simple.md index edfa7f58a6..29f8f9a533 100644 --- a/_analyzers/simple.md +++ b/_analyzers/supported-analyzers/simple.md @@ -1,7 +1,8 @@ --- layout: default title: Simple analyzer -nav_order: 50 +parent: Analyzers +nav_order: 100 --- # Simple analyzer diff --git a/_analyzers/standard.md b/_analyzers/supported-analyzers/standard.md similarity index 98% rename from _analyzers/standard.md rename to _analyzers/supported-analyzers/standard.md index e4a7a70fbc..d5c3650d5d 100644 --- a/_analyzers/standard.md +++ b/_analyzers/supported-analyzers/standard.md @@ -1,7 +1,8 @@ --- layout: default title: Standard analyzer -nav_order: 40 +parent: Analyzers +nav_order: 50 --- # Standard analyzer diff --git a/_analyzers/stop.md b/_analyzers/supported-analyzers/stop.md similarity index 99% rename from _analyzers/stop.md rename to _analyzers/supported-analyzers/stop.md index 68dc554473..df62c7fe58 100644 --- a/_analyzers/stop.md +++ b/_analyzers/supported-analyzers/stop.md @@ -1,7 +1,8 @@ --- layout: default title: Stop analyzer -nav_order: 70 +parent: Analyzers +nav_order: 110 --- # Stop analyzer diff --git a/_analyzers/whitespace.md b/_analyzers/supported-analyzers/whitespace.md similarity index 98% rename from _analyzers/whitespace.md rename to _analyzers/supported-analyzers/whitespace.md index 67fee61295..4691b4f733 100644 --- a/_analyzers/whitespace.md +++ b/_analyzers/supported-analyzers/whitespace.md @@ -1,7 +1,8 @@ --- layout: default title: Whitespace analyzer -nav_order: 60 +parent: Analyzers +nav_order: 120 --- # Whitespace analyzer diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index b06489c805..875e94db5a 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -63,5 +63,5 @@ Token filter | Underlying Lucene token filter| Description [`truncate`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/truncate/) | [TruncateTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens with lengths exceeding the specified character limit. [`unique`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/unique/) | N/A | Ensures that each token is unique by removing duplicate tokens from a stream. [`uppercase`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/uppercase/) | [UpperCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to uppercase. -[`word_delimiter`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter/) | [WordDelimiterFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. -[`word_delimiter_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter-graph/) | [WordDelimiterGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens at non-alphanumeric characters and performs normalization based on the specified rules. Assigns a `positionLength` attribute to multi-position tokens. +[`word_delimiter`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter/) | [WordDelimiterFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html) | Splits tokens on non-alphanumeric characters and performs normalization based on the specified rules. +[`word_delimiter_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter-graph/) | [WordDelimiterGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html) | Splits tokens on non-alphanumeric characters and performs normalization based on the specified rules. Assigns a `positionLength` attribute to multi-position tokens. diff --git a/_analyzers/token-filters/word-delimiter-graph.md b/_analyzers/token-filters/word-delimiter-graph.md index ac734bebeb..b901f5a0e5 100644 --- a/_analyzers/token-filters/word-delimiter-graph.md +++ b/_analyzers/token-filters/word-delimiter-graph.md @@ -7,7 +7,7 @@ nav_order: 480 # Word delimiter graph token filter -The `word_delimiter_graph` token filter is used to split tokens at predefined characters and also offers optional token normalization based on customizable rules. +The `word_delimiter_graph` token filter is used to splits token on predefined characters and also offers optional token normalization based on customizable rules. The `word_delimiter_graph` filter is used to remove punctuation from complex identifiers like part numbers or product IDs. In such cases, it is best used with the `keyword` tokenizer. For hyphenated words, use the `synonym_graph` token filter instead of the `word_delimiter_graph` filter because users frequently search for these terms both with and without hyphens. {: .note} @@ -44,7 +44,7 @@ Parameter | Required/Optional | Data type | Description `split_on_case_change` | Optional | Boolean | Splits tokens where consecutive letters have different cases (one is lowercase and the other is uppercase). For example, `"OpenSearch"` becomes `[ Open, Search ]`. Default is `true`. `split_on_numerics` | Optional | Boolean | Splits tokens where there are consecutive letters and numbers. For example `"v8engine"` will become `[ v, 8, engine ]`. Default is `true`. `stem_english_possessive` | Optional | Boolean | Removes English possessive endings, such as `'s`. Default is `true`. -`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split at hyphens. Valid types are:
- `ALPHA`: alphabetical
- `ALPHANUM`: alphanumeric
- `DIGIT`: numeric
- `LOWER`: lowercase alphabetical
- `SUBWORD_DELIM`: non-alphanumeric delimiter
- `UPPER`: uppercase alphabetical +`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split on hyphens. Valid types are:
- `ALPHA`: alphabetical
- `ALPHANUM`: alphanumeric
- `DIGIT`: numeric
- `LOWER`: lowercase alphabetical
- `SUBWORD_DELIM`: non-alphanumeric delimiter
- `UPPER`: uppercase alphabetical `type_table_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing a custom character map. The map specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For valid types, see `type_table`. ## Example diff --git a/_analyzers/token-filters/word-delimiter.md b/_analyzers/token-filters/word-delimiter.md index d820fae2a0..77a71f28fb 100644 --- a/_analyzers/token-filters/word-delimiter.md +++ b/_analyzers/token-filters/word-delimiter.md @@ -7,7 +7,7 @@ nav_order: 470 # Word delimiter token filter -The `word_delimiter` token filter is used to split tokens at predefined characters and also offers optional token normalization based on customizable rules. +The `word_delimiter` token filter is used to splits token on predefined characters and also offers optional token normalization based on customizable rules. We recommend using the `word_delimiter_graph` filter instead of the `word_delimiter` filter whenever possible because the `word_delimiter` filter sometimes produces invalid token graphs. For more information about the differences between the two filters, see [Differences between the `word_delimiter_graph` and `word_delimiter` filters]({{site.url}}{{site.baseurl}}/analyzers/token-filters/word-delimiter-graph/#differences-between-the-word_delimiter_graph-and-word_delimiter-filters). {: .important} @@ -45,7 +45,7 @@ Parameter | Required/Optional | Data type | Description `split_on_case_change` | Optional | Boolean | Splits tokens where consecutive letters have different cases (one is lowercase and the other is uppercase). For example, `"OpenSearch"` becomes `[ Open, Search ]`. Default is `true`. `split_on_numerics` | Optional | Boolean | Splits tokens where there are consecutive letters and numbers. For example `"v8engine"` will become `[ v, 8, engine ]`. Default is `true`. `stem_english_possessive` | Optional | Boolean | Removes English possessive endings, such as `'s`. Default is `true`. -`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split at hyphens. Valid types are:
- `ALPHA`: alphabetical
- `ALPHANUM`: alphanumeric
- `DIGIT`: numeric
- `LOWER`: lowercase alphabetical
- `SUBWORD_DELIM`: non-alphanumeric delimiter
- `UPPER`: uppercase alphabetical +`type_table` | Optional | Array of strings | A custom map that specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For example, to treat a hyphen (`-`) as an alphanumeric character, specify `["- => ALPHA"]` so that words are not split on hyphens. Valid types are:
- `ALPHA`: alphabetical
- `ALPHANUM`: alphanumeric
- `DIGIT`: numeric
- `LOWER`: lowercase alphabetical
- `SUBWORD_DELIM`: non-alphanumeric delimiter
- `UPPER`: uppercase alphabetical `type_table_path` | Optional | String | Specifies a path (absolute or relative to the config directory) to a file containing a custom character map. The map specifies how to treat characters and whether to treat them as delimiters, which avoids unwanted splitting. For valid types, see `type_table`. ## Example diff --git a/_analyzers/tokenizers/index.md b/_analyzers/tokenizers/index.md index 1f9e49c855..f5b5ff0f25 100644 --- a/_analyzers/tokenizers/index.md +++ b/_analyzers/tokenizers/index.md @@ -56,7 +56,7 @@ Tokenizer | Description | Example `keyword` | - No-op tokenizer
- Outputs the entire string unchanged
- Can be combined with token filters, like lowercase, to normalize terms | `My repo`
becomes
`My repo` `pattern` | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms
- Uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) | `https://opensearch.org/forum`
becomes
[`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`)
Can be configured with a regex pattern `simple_pattern` | - Uses a regular expression pattern to return matching text as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | Returns an empty array by default
Must be configured with a pattern because the pattern defaults to an empty string -`simple_pattern_split` | - Uses a regular expression pattern to split the text at matches rather than returning the matches as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default
Must be configured with a pattern +`simple_pattern_split` | - Uses a regular expression pattern to split the text on matches rather than returning the matches as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default
Must be configured with a pattern `char_group` | - Parses on a set of configurable characters
- Faster than tokenizers that run regular expressions | No-op by default
Must be configured with a list of characters `path_hierarchy` | - Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy | `one/two/three`
becomes
[`one`, `one/two`, `one/two/three`] diff --git a/_analyzers/tokenizers/pattern.md b/_analyzers/tokenizers/pattern.md index f422d8c805..036dd9050f 100644 --- a/_analyzers/tokenizers/pattern.md +++ b/_analyzers/tokenizers/pattern.md @@ -11,7 +11,7 @@ The `pattern` tokenizer is a highly flexible tokenizer that allows you to split ## Example usage -The following example request creates a new index named `my_index` and configures an analyzer with a `pattern` tokenizer. The tokenizer splits text at `-`, `_`, or `.` characters: +The following example request creates a new index named `my_index` and configures an analyzer with a `pattern` tokenizer. The tokenizer splits text on `-`, `_`, or `.` characters: ```json PUT /my_index @@ -102,7 +102,7 @@ Parameter | Required/Optional | Data type | Description :--- | :--- | :--- | :--- `pattern` | Optional | String | The pattern used to split text into tokens, specified using a [Java regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). Default is `\W+`. `flags` | Optional | String | Configures pipe-separated [flags](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary) to apply to the regular expression, for example, `"CASE_INSENSITIVE|MULTILINE|DOTALL"`. -`group` | Optional | Integer | Specifies the capture group to be used as a token. Default is `-1` (split at a match). +`group` | Optional | Integer | Specifies the capture group to be used as a token. Default is `-1` (split on a match). ## Example using a group parameter diff --git a/_analyzers/tokenizers/simple-pattern-split.md b/_analyzers/tokenizers/simple-pattern-split.md index 1fd130082e..25367f25b5 100644 --- a/_analyzers/tokenizers/simple-pattern-split.md +++ b/_analyzers/tokenizers/simple-pattern-split.md @@ -13,7 +13,7 @@ The tokenizer uses the matched parts of the input text (based on the regular exp ## Example usage -The following example request creates a new index named `my_index` and configures an analyzer with a `simple_pattern_split` tokenizer. The tokenizer is configured to split text at hyphens: +The following example request creates a new index named `my_index` and configures an analyzer with a `simple_pattern_split` tokenizer. The tokenizer is configured to split text on hyphens: ```json PUT /my_index diff --git a/_analyzers/tokenizers/whitespace.md b/_analyzers/tokenizers/whitespace.md index 604eeeb6a0..fb168304a7 100644 --- a/_analyzers/tokenizers/whitespace.md +++ b/_analyzers/tokenizers/whitespace.md @@ -7,7 +7,7 @@ nav_order: 160 # Whitespace tokenizer -The `whitespace` tokenizer splits text at white space characters, such as spaces, tabs, and new lines. It treats each word separated by white space as a token and does not perform any additional analysis or normalization like lowercasing or punctuation removal. +The `whitespace` tokenizer splits text on white space characters, such as spaces, tabs, and new lines. It treats each word separated by white space as a token and does not perform any additional analysis or normalization like lowercasing or punctuation removal. ## Example usage From aae13ff11047d60646f0610e914e23e8ddc1041b Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Tue, 10 Dec 2024 12:50:54 -0500 Subject: [PATCH 2/2] Remove extra parentheses Signed-off-by: Fanit Kolchina --- _analyzers/supported-analyzers/index.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/_analyzers/supported-analyzers/index.md b/_analyzers/supported-analyzers/index.md index f67ba68635..b54660478f 100644 --- a/_analyzers/supported-analyzers/index.md +++ b/_analyzers/supported-analyzers/index.md @@ -21,11 +21,11 @@ Analyzer | Analysis performed | Analyzer output [**Standard**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/standard/) (default) | - Parses strings into tokens at word boundaries
- Removes most punctuation
- Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] [**Simple**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/simple/) | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`] [**Whitespace**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/whitespace/) | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`] -[**Stop**](({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/stop/)) | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Removes stop words
- Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`] -[**Keyword**](({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/keyword/)) (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`] -[**Pattern**](({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/pattern/))| - Parses strings into tokens using regular expressions
- Supports converting strings to lowercase
- Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] +[**Stop**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/stop/) | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Removes stop words
- Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`] +[**Keyword**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/keyword/) (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`] +[**Pattern**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/pattern/)| - Parses strings into tokens using regular expressions
- Supports converting strings to lowercase
- Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] [**Language**]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/index/) | Performs analysis specific to a certain language (for example, `english`). | [`fun`, `contribut`, `brand`, `new`, `pr`, `2`, `opensearch`] -[**Fingerprint**](({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/fingerprint/)) | - Parses strings on any non-letter character
- Normalizes characters by converting them to ASCII
- Converts tokens to lowercase
- Sorts, deduplicates, and concatenates tokens into a single token
- Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`]
Note that the apostrophe was converted to its ASCII counterpart. +[**Fingerprint**]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/fingerprint/) | - Parses strings on any non-letter character
- Normalizes characters by converting them to ASCII
- Converts tokens to lowercase
- Sorts, deduplicates, and concatenates tokens into a single token
- Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`]
Note that the apostrophe was converted to its ASCII counterpart. ## Language analyzers @@ -37,5 +37,5 @@ The following table lists the additional analyzers that OpenSearch supports. | Analyzer | Analysis performed | |:---------------|:---------------------------------------------------------------------------------------------------------| -| `phone` | An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) for parsing phone numbers. | -| `phone-search` | A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) for parsing phone numbers. | +| [`phone`]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/phone-analyzers/#the-phone-analyzer) | An [index analyzer]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) for parsing phone numbers. | +| [`phone-search`]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/phone-analyzers/#the-phone-search-analyzer) | A [search analyzer]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/) for parsing phone numbers. |