From bb41c1f06c766e00508c72a4c2480e149c754b49 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Mon, 25 Nov 2024 15:41:38 +0000 Subject: [PATCH] add synonym graph token filter docs #8448 (#8458) * add synonym graph token filter docs #8448 Signed-off-by: Anton Rubin * updating parameter table Signed-off-by: Anton Rubin * Doc review Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower (cherry picked from commit d0a28b32e2621627806e3dc7e47b9840fcd419b3) Signed-off-by: github-actions[bot] --- _analyzers/token-filters/index.md | 2 +- _analyzers/token-filters/synonym-graph.md | 180 ++++++++++++++++++++++ 2 files changed, 181 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/synonym-graph.md diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index 39dcdbdc93..10861aaf40 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -58,7 +58,7 @@ Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache `stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed. `stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream. [`synonym`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym/) | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file. -`synonym_graph` | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process. +[`synonym_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym-graph/) | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process. `trim` | [TrimFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing white space from each token in a stream. `truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit. `unique` | N/A | Ensures each token is unique by removing duplicate tokens from a stream. diff --git a/_analyzers/token-filters/synonym-graph.md b/_analyzers/token-filters/synonym-graph.md new file mode 100644 index 0000000000..75c7c79151 --- /dev/null +++ b/_analyzers/token-filters/synonym-graph.md @@ -0,0 +1,180 @@ +--- +layout: default +title: Synonym graph +parent: Token filters +nav_order: 420 +--- + +# Synonym graph token filter + +The `synonym_graph` token filter is a more advanced version of the `synonym` token filter. It supports multiword synonyms and processes synonyms across multiple tokens, making it ideal for phrases or scenarios in which relationships between tokens are important. + +## Parameters + +The `synonym_graph` token filter can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`synonyms` | Either `synonyms` or `synonyms_path` must be specified | String | A list of synonym rules defined directly in the configuration. +`synonyms_path` | Either `synonyms` or `synonyms_path` must be specified | String | The file path to a file containing synonym rules (either an absolute path or a path relative to the config directory). +`lenient` | Optional | Boolean | Whether to ignore exceptions when loading the rule configurations. Default is `false`. +`format` | Optional | String | Specifies the format used to determine how OpenSearch defines and interprets synonyms. Valid values are:
- `solr`
- [`wordnet`](https://wordnet.princeton.edu/).
Default is `solr`. +`expand` | Optional | Boolean | Whether to expand equivalent synonym rules. Default is `false`.

For example:
If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows:
- `quick => quick`
- `quick => fast`
- `fast => quick`
- `fast => fast`

If `expand` is set to `false`, the synonym rules are configured as follows:
- `quick => quick`
- `fast => quick` + +## Example: Solr format + +The following example request creates a new index named `my-index` and configures an analyzer with a `synonym_graph` filter. The filter is configured with the default `solr` rule format: + +```json +PUT /my-index +{ + "settings": { + "analysis": { + "filter": { + "my_synonym_graph_filter": { + "type": "synonym_graph", + "synonyms": [ + "sports car, race car", + "fast car, speedy vehicle", + "luxury car, premium vehicle", + "electric car, EV" + ] + } + }, + "analyzer": { + "my_synonym_graph_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_synonym_graph_filter" + ] + } + } + } + } +} + +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my-car-index/_analyze +{ + "analyzer": "my_synonym_graph_analyzer", + "text": "I just bought a sports car and it is a fast car." +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + {"token": "i","start_offset": 0,"end_offset": 1,"type": "","position": 0}, + {"token": "just","start_offset": 2,"end_offset": 6,"type": "","position": 1}, + {"token": "bought","start_offset": 7,"end_offset": 13,"type": "","position": 2}, + {"token": "a","start_offset": 14,"end_offset": 15,"type": "","position": 3}, + {"token": "race","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4}, + {"token": "sports","start_offset": 16,"end_offset": 22,"type": "","position": 4,"positionLength": 2}, + {"token": "car","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 5,"positionLength": 2}, + {"token": "car","start_offset": 23,"end_offset": 26,"type": "","position": 6}, + {"token": "and","start_offset": 27,"end_offset": 30,"type": "","position": 7}, + {"token": "it","start_offset": 31,"end_offset": 33,"type": "","position": 8}, + {"token": "is","start_offset": 34,"end_offset": 36,"type": "","position": 9}, + {"token": "a","start_offset": 37,"end_offset": 38,"type": "","position": 10}, + {"token": "speedy","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 11}, + {"token": "fast","start_offset": 39,"end_offset": 43,"type": "","position": 11,"positionLength": 2}, + {"token": "vehicle","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 12,"positionLength": 2}, + {"token": "car","start_offset": 44,"end_offset": 47,"type": "","position": 13} + ] +} +``` + +## Example: WordNet format + +The following example request creates a new index named `my-wordnet-index` and configures an analyzer with a `synonym_graph` filter. The filter is configured with the [`wordnet`](https://wordnet.princeton.edu/) rule format: + +```json +PUT /my-wordnet-index +{ + "settings": { + "analysis": { + "filter": { + "my_synonym_graph_filter": { + "type": "synonym_graph", + "format": "wordnet", + "synonyms": [ + "s(100000001, 1, 'sports car', n, 1, 0).", + "s(100000001, 2, 'race car', n, 1, 0).", + "s(100000001, 3, 'fast car', n, 1, 0).", + "s(100000001, 4, 'speedy vehicle', n, 1, 0)." + ] + } + }, + "analyzer": { + "my_synonym_graph_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_synonym_graph_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my-wordnet-index/_analyze +{ + "analyzer": "my_synonym_graph_analyzer", + "text": "I just bought a sports car and it is a fast car." +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + {"token": "i","start_offset": 0,"end_offset": 1,"type": "","position": 0}, + {"token": "just","start_offset": 2,"end_offset": 6,"type": "","position": 1}, + {"token": "bought","start_offset": 7,"end_offset": 13,"type": "","position": 2}, + {"token": "a","start_offset": 14,"end_offset": 15,"type": "","position": 3}, + {"token": "race","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4}, + {"token": "fast","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4,"positionLength": 2}, + {"token": "speedy","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4,"positionLength": 3}, + {"token": "sports","start_offset": 16,"end_offset": 22,"type": "","position": 4,"positionLength": 4}, + {"token": "car","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 5,"positionLength": 4}, + {"token": "car","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 6,"positionLength": 3}, + {"token": "vehicle","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 7,"positionLength": 2}, + {"token": "car","start_offset": 23,"end_offset": 26,"type": "","position": 8}, + {"token": "and","start_offset": 27,"end_offset": 30,"type": "","position": 9}, + {"token": "it","start_offset": 31,"end_offset": 33,"type": "","position": 10}, + {"token": "is","start_offset": 34,"end_offset": 36,"type": "","position": 11}, + {"token": "a","start_offset": 37,"end_offset": 38,"type": "","position": 12}, + {"token": "sports","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 13}, + {"token": "race","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 13,"positionLength": 2}, + {"token": "speedy","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 13,"positionLength": 3}, + {"token": "fast","start_offset": 39,"end_offset": 43,"type": "","position": 13,"positionLength": 4}, + {"token": "car","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 14,"positionLength": 4}, + {"token": "car","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 15,"positionLength": 3}, + {"token": "vehicle","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 16,"positionLength": 2}, + {"token": "car","start_offset": 44,"end_offset": 47,"type": "","position": 17} + ] +} +```