add synonym graph token filter docs #8448 (#8458)

* add synonym graph token filter docs #8448 Signed-off-by: Anton Rubin <[email protected]> * updating parameter table Signed-off-by: Anton Rubin <[email protected]> * Doc review Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> --------- Signed-off-by: Anton Rubin <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Nathan Bower <[email protected]> (cherry picked from commit d0a28b3) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-project · Nov 25, 2024 · bb41c1f · bb41c1f
1 parent fe080a6
commit bb41c1f
Show file tree

Hide file tree

Showing 2 changed files with 181 additions and 1 deletion.
diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md
@@ -58,7 +58,7 @@ Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache
 `stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
 `stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
 [`synonym`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym/) | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file.
-`synonym_graph` | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process.
+[`synonym_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym-graph/) | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process.
 `trim` | [TrimFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing white space from each token in a stream. 
 `truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit. 
 `unique` | N/A | Ensures each token is unique by removing duplicate tokens from a stream. 

diff --git a/_analyzers/token-filters/synonym-graph.md b/_analyzers/token-filters/synonym-graph.md
@@ -0,0 +1,180 @@
+---
+layout: default
+title: Synonym graph
+parent: Token filters
+nav_order: 420
+---
+
+# Synonym graph token filter
+
+The `synonym_graph` token filter is a more advanced version of the `synonym` token filter. It supports multiword synonyms and processes synonyms across multiple tokens, making it ideal for phrases or scenarios in which relationships between tokens are important.
+
+## Parameters
+
+The `synonym_graph` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :--- 
+`synonyms` | Either `synonyms` or `synonyms_path` must be specified | String | A list of synonym rules defined directly in the configuration.
+`synonyms_path` | Either `synonyms` or `synonyms_path` must be specified | String | The file path to a file containing synonym rules (either an absolute path or a path relative to the config directory).
+`lenient` | Optional | Boolean | Whether to ignore exceptions when loading the rule configurations. Default is `false`.
+`format` | Optional | String | Specifies the format used to determine how OpenSearch defines and interprets synonyms. Valid values are:<br>- `solr` <br>- [`wordnet`](https://wordnet.princeton.edu/). <br> Default is `solr`.
+`expand` | Optional | Boolean |  Whether to expand equivalent synonym rules. Default is `false`.<br><br>For example: <br>If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows:<br>- `quick => quick`<br>- `quick => fast`<br>- `fast => quick`<br>- `fast => fast`<br><br>If `expand` is set to `false`, the synonym rules are configured as follows:<br>- `quick => quick`<br>- `fast => quick`
+
+## Example: Solr format
+
+The following example request creates a new index named `my-index` and configures an analyzer with a `synonym_graph` filter. The filter is configured with the default `solr` rule format:
+
+```json
+PUT /my-index
+{
+  "settings": {
+    "analysis": {
+      "filter": {
+        "my_synonym_graph_filter": {
+          "type": "synonym_graph",
+          "synonyms": [
+            "sports car, race car",
+            "fast car, speedy vehicle",
+            "luxury car, premium vehicle",
+            "electric car, EV"
+          ]
+        }
+      },
+      "analyzer": {
+        "my_synonym_graph_analyzer": {
+          "type": "custom",
+          "tokenizer": "standard",
+          "filter": [
+            "lowercase",
+            "my_synonym_graph_filter"
+          ]
+        }
+      }
+    }
+  }
+}
+
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-car-index/_analyze
+{
+  "analyzer": "my_synonym_graph_analyzer",
+  "text": "I just bought a sports car and it is a fast car."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {"token": "i","start_offset": 0,"end_offset": 1,"type": "<ALPHANUM>","position": 0},
+    {"token": "just","start_offset": 2,"end_offset": 6,"type": "<ALPHANUM>","position": 1},
+    {"token": "bought","start_offset": 7,"end_offset": 13,"type": "<ALPHANUM>","position": 2},
+    {"token": "a","start_offset": 14,"end_offset": 15,"type": "<ALPHANUM>","position": 3},
+    {"token": "race","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4},
+    {"token": "sports","start_offset": 16,"end_offset": 22,"type": "<ALPHANUM>","position": 4,"positionLength": 2},
+    {"token": "car","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 5,"positionLength": 2},
+    {"token": "car","start_offset": 23,"end_offset": 26,"type": "<ALPHANUM>","position": 6},
+    {"token": "and","start_offset": 27,"end_offset": 30,"type": "<ALPHANUM>","position": 7},
+    {"token": "it","start_offset": 31,"end_offset": 33,"type": "<ALPHANUM>","position": 8},
+    {"token": "is","start_offset": 34,"end_offset": 36,"type": "<ALPHANUM>","position": 9},
+    {"token": "a","start_offset": 37,"end_offset": 38,"type": "<ALPHANUM>","position": 10},
+    {"token": "speedy","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 11},
+    {"token": "fast","start_offset": 39,"end_offset": 43,"type": "<ALPHANUM>","position": 11,"positionLength": 2},
+    {"token": "vehicle","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 12,"positionLength": 2},
+    {"token": "car","start_offset": 44,"end_offset": 47,"type": "<ALPHANUM>","position": 13}
+  ]
+}
+```
+
+## Example: WordNet format
+
+The following example request creates a new index named `my-wordnet-index` and configures an analyzer with a `synonym_graph` filter. The filter is configured with the [`wordnet`](https://wordnet.princeton.edu/) rule format:
+
+```json
+PUT /my-wordnet-index
+{
+  "settings": {
+    "analysis": {
+      "filter": {
+        "my_synonym_graph_filter": {
+          "type": "synonym_graph",
+          "format": "wordnet",
+          "synonyms": [
+            "s(100000001, 1, 'sports car', n, 1, 0).",
+            "s(100000001, 2, 'race car', n, 1, 0).",
+            "s(100000001, 3, 'fast car', n, 1, 0).",
+            "s(100000001, 4, 'speedy vehicle', n, 1, 0)."
+          ]
+        }
+      },
+      "analyzer": {
+        "my_synonym_graph_analyzer": {
+          "type": "custom",
+          "tokenizer": "standard",
+          "filter": [
+            "lowercase",
+            "my_synonym_graph_filter"
+          ]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-wordnet-index/_analyze
+{
+  "analyzer": "my_synonym_graph_analyzer",
+  "text": "I just bought a sports car and it is a fast car."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {"token": "i","start_offset": 0,"end_offset": 1,"type": "<ALPHANUM>","position": 0},
+    {"token": "just","start_offset": 2,"end_offset": 6,"type": "<ALPHANUM>","position": 1},
+    {"token": "bought","start_offset": 7,"end_offset": 13,"type": "<ALPHANUM>","position": 2},
+    {"token": "a","start_offset": 14,"end_offset": 15,"type": "<ALPHANUM>","position": 3},
+    {"token": "race","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4},
+    {"token": "fast","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4,"positionLength": 2},
+    {"token": "speedy","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4,"positionLength": 3},
+    {"token": "sports","start_offset": 16,"end_offset": 22,"type": "<ALPHANUM>","position": 4,"positionLength": 4},
+    {"token": "car","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 5,"positionLength": 4},
+    {"token": "car","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 6,"positionLength": 3},
+    {"token": "vehicle","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 7,"positionLength": 2},
+    {"token": "car","start_offset": 23,"end_offset": 26,"type": "<ALPHANUM>","position": 8},
+    {"token": "and","start_offset": 27,"end_offset": 30,"type": "<ALPHANUM>","position": 9},
+    {"token": "it","start_offset": 31,"end_offset": 33,"type": "<ALPHANUM>","position": 10},
+    {"token": "is","start_offset": 34,"end_offset": 36,"type": "<ALPHANUM>","position": 11},
+    {"token": "a","start_offset": 37,"end_offset": 38,"type": "<ALPHANUM>","position": 12},
+    {"token": "sports","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 13},
+    {"token": "race","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 13,"positionLength": 2},
+    {"token": "speedy","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 13,"positionLength": 3},
+    {"token": "fast","start_offset": 39,"end_offset": 43,"type": "<ALPHANUM>","position": 13,"positionLength": 4},
+    {"token": "car","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 14,"positionLength": 4},
+    {"token": "car","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 15,"positionLength": 3},
+    {"token": "vehicle","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 16,"positionLength": 2},
+    {"token": "car","start_offset": 44,"end_offset": 47,"type": "<ALPHANUM>","position": 17}
+  ]
+}
+```