Merge branch 'main' into innerhit

opensearch-project · Dec 11, 2024 · 8ed6777 · 8ed6777
2 parents fe4894e + 23729b7
commit 8ed6777
Show file tree

Hide file tree

Showing 131 changed files with 6,319 additions and 2,061 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -1 +1 @@
-*  @kolchfa-aws @Naarcha-AWS @vagimeli @AMoo-Miki @natebower @dlvenable @stephen-crawford @epugh
+*  @kolchfa-aws @Naarcha-AWS @AMoo-Miki @natebower @dlvenable @epugh
diff --git a/.github/workflows/pr_checklist.yml b/.github/workflows/pr_checklist.yml
@@ -29,7 +29,7 @@ jobs:
         with:
           script: |
             let assignee = context.payload.pull_request.user.login;
-            const prOwners = ['Naarcha-AWS', 'kolchfa-aws', 'vagimeli', 'natebower'];
+            const prOwners = ['Naarcha-AWS', 'kolchfa-aws', 'natebower'];
             
             if (!prOwners.includes(assignee)) {
               assignee = 'kolchfa-aws'
@@ -40,4 +40,4 @@ jobs:
                 owner: context.repo.owner,
                 repo: context.repo.repo,
                 assignees: [assignee]
-              });
+              });
diff --git a/.ruby-version b/.ruby-version
diff --git a/MAINTAINERS.md b/MAINTAINERS.md
@@ -9,14 +9,14 @@ This document lists the maintainers in this repo. See [opensearch-project/.githu
 | Fanit Kolchina   | [kolchfa-aws](https://github.com/kolchfa-aws)   | Amazon      |
 | Nate Archer      | [Naarcha-AWS](https://github.com/Naarcha-AWS)   | Amazon      |
 | Nathan Bower     | [natebower](https://github.com/natebower)       | Amazon      |
-| Melissa Vagi     | [vagimeli](https://github.com/vagimeli)         | Amazon      |
 | Miki Barahmand   | [AMoo-Miki](https://github.com/AMoo-Miki)       | Amazon      |
 | David Venable    | [dlvenable](https://github.com/dlvenable)       | Amazon      | 
-| Stephen Crawford | [stephen-crawford](https://github.com/stephen-crawford) | Amazon      |
 | Eric Pugh        | [epugh](https://github.com/epugh)               | OpenSource Connections  | 
 
 ## Emeritus
 
-| Maintainer       | GitHub ID                                       | Affiliation |
-| ---------------- | ----------------------------------------------- | ----------- |
-| Heather Halter   | [hdhalter](https://github.com/hdhalter)         | Amazon      |
+| Maintainer       | GitHub ID                                               | Affiliation |
+| ---------------- | ------------------------------------------------------- | ----------- |
+| Heather Halter   | [hdhalter](https://github.com/hdhalter)                 | Amazon      |
+| Melissa Vagi     | [vagimeli](https://github.com/vagimeli)                 | Amazon      |
+| Stephen Crawford | [stephen-crawford](https://github.com/stephen-crawford) | Amazon      |
diff --git a/README.md b/README.md
@@ -24,7 +24,6 @@ If you encounter problems or have questions when contributing to the documentati
 
 - [kolchfa-aws](https://github.com/kolchfa-aws)
 - [Naarcha-AWS](https://github.com/Naarcha-AWS)
-- [vagimeli](https://github.com/vagimeli)
 
 
 ## Code of conduct

diff --git a/_about/version-history.md b/_about/version-history.md
@@ -34,6 +34,7 @@ OpenSearch version | Release highlights | Release date
 [2.0.1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.1.md) | Includes bug fixes and maintenance updates for Alerting and Anomaly Detection. | 16 June 2022
 [2.0.0](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0.md) | Includes document-level monitors for alerting, OpenSearch Notifications plugins, and Geo Map Tiles in OpenSearch Dashboards. Also adds support for Lucene 9 and bug fixes for all OpenSearch plugins. For a full list of release highlights, see the Release Notes. | 26 May 2022
 [2.0.0-rc1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0-rc1.md) | The Release Candidate for 2.0.0. This version allows you to preview the upcoming 2.0.0 release before the GA release. The preview release adds document-level alerting, support for Lucene 9, and the ability to use term lookup queries in document level security. | 03 May 2022
+[1.3.20](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.20.md) | Includes enhancements to Anomaly Detection Dashboards, bug fixes for Alerting and Dashboards Reports, and maintenance updates for several OpenSearch components. | 11 December 2024
 [1.3.19](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.19.md) | Includes bug fixes and maintenance updates for OpenSearch security, OpenSearch security Dashboards, and anomaly detection. | 27 August 2024
 [1.3.18](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.18.md) | Includes maintenance updates for OpenSearch security. | 16 July 2024
 [1.3.17](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.17.md) | Includes maintenance updates for OpenSearch security and OpenSearch Dashboards security. | 06 June 2024

diff --git a/_analyzers/custom-analyzer.md b/_analyzers/custom-analyzer.md
@@ -0,0 +1,312 @@
+---
+layout: default
+title: Creating a custom analyzer
+nav_order: 40
+parent: Analyzers
+---
+
+# Creating a custom analyzer
+
+To create a custom analyzer, specify a combination of the following components:
+
+- Character filters (zero or more)
+
+- Tokenizer (one)
+
+- Token filters (zero or more)
+
+## Configuration
+
+The following parameters can be used to configure a custom analyzer.
+
+| Parameter                | Required/Optional | Description  |
+|:--- | :--- | :--- |
+| `type`                   | Optional          | The analyzer type. Default is `custom`. You can also specify a prebuilt analyzer using this parameter.              |
+| `tokenizer`              | Required          | A tokenizer to be included in the analyzer. |
+| `char_filter`            | Optional          | A list of character filters to be included in the analyzer. |
+| `filter`                 | Optional          | A list of token filters to be included in the analyzer. |
+| `position_increment_gap` | Optional          | The extra spacing applied between values when indexing text fields that have multiple values. For more information, see [Position increment gap](#position-increment-gap). Default is `100`. |
+
+## Examples
+
+The following examples demonstrate various custom analyzer configurations.
+
+### Custom analyzer with a character filter for HTML stripping
+
+The following example analyzer removes HTML tags from text before tokenization:
+
+```json
+PUT simple_html_strip_analyzer_index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "html_strip_analyzer": {
+          "type": "custom",
+          "char_filter": ["html_strip"],
+          "tokenizer": "whitespace",
+          "filter": ["lowercase"]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET simple_html_strip_analyzer_index/_analyze
+{
+  "analyzer": "html_strip_analyzer",
+  "text": "<p>OpenSearch is <strong>awesome</strong>!</p>"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {
+      "token": "opensearch",
+      "start_offset": 3,
+      "end_offset": 13,
+      "type": "word",
+      "position": 0
+    },
+    {
+      "token": "is",
+      "start_offset": 14,
+      "end_offset": 16,
+      "type": "word",
+      "position": 1
+    },
+    {
+      "token": "awesome!",
+      "start_offset": 25,
+      "end_offset": 42,
+      "type": "word",
+      "position": 2
+    }
+  ]
+}
+```
+
+### Custom analyzer with a mapping character filter for synonym replacement
+
+The following example analyzer replaces specific characters and patterns before applying the synonym filter:
+
+```json
+PUT mapping_analyzer_index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "synonym_mapping_analyzer": {
+          "type": "custom",
+          "char_filter": ["underscore_to_space"],
+          "tokenizer": "standard",
+          "filter": ["lowercase", "stop", "synonym_filter"]
+        }
+      },
+      "char_filter": {
+        "underscore_to_space": {
+          "type": "mapping",
+          "mappings": ["_ => ' '"]
+        }
+      },
+      "filter": {
+        "synonym_filter": {
+          "type": "synonym",
+          "synonyms": [
+            "quick, fast, speedy",
+            "big, large, huge"
+          ]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET mapping_analyzer_index/_analyze
+{
+  "analyzer": "synonym_mapping_analyzer",
+  "text": "The slow_green_turtle is very large"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {"token": "slow","start_offset": 4,"end_offset": 8,"type": "<ALPHANUM>","position": 1},
+    {"token": "green","start_offset": 9,"end_offset": 14,"type": "<ALPHANUM>","position": 2},
+    {"token": "turtle","start_offset": 15,"end_offset": 21,"type": "<ALPHANUM>","position": 3},
+    {"token": "very","start_offset": 25,"end_offset": 29,"type": "<ALPHANUM>","position": 5},
+    {"token": "large","start_offset": 30,"end_offset": 35,"type": "<ALPHANUM>","position": 6},
+    {"token": "big","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6},
+    {"token": "huge","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6}
+  ]
+}
+```
+
+### Custom analyzer with a custom pattern-based character filter for number normalization
+
+The following example analyzer normalizes phone numbers by removing dashes and spaces and applies edge n-grams to the normalized text to support partial matches:
+
+```json
+PUT advanced_pattern_replace_analyzer_index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "phone_number_analyzer": {
+          "type": "custom",
+          "char_filter": ["phone_normalization"],
+          "tokenizer": "standard",
+          "filter": ["lowercase", "edge_ngram"]
+        }
+      },
+      "char_filter": {
+        "phone_normalization": {
+          "type": "pattern_replace",
+          "pattern": "[-\\s]",
+          "replacement": ""
+        }
+      },
+      "filter": {
+        "edge_ngram": {
+          "type": "edge_ngram",
+          "min_gram": 3,
+          "max_gram": 10
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET advanced_pattern_replace_analyzer_index/_analyze
+{
+  "analyzer": "phone_number_analyzer",
+  "text": "123-456 7890"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {"token": "123","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
+    {"token": "1234","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
+    {"token": "12345","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
+    {"token": "123456","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
+    {"token": "1234567","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
+    {"token": "12345678","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
+    {"token": "123456789","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
+    {"token": "1234567890","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0}
+  ]
+}
+```
+
+## Position increment gap
+
+The `position_increment_gap` parameter sets a positional gap between terms when indexing multi-valued fields, such as arrays. This gap ensures that phrase queries don't match terms across separate values unless explicitly allowed. For example, a default gap of 100 specifies that terms in different array entries are 100 positions apart, preventing unintended matches in phrase searches. You can adjust this value or set it to `0` in order to allow phrases to span across array values.
+
+The following example demonstrates the effect of `position_increment_gap` using a `match_phrase` query.
+
+1. Index a document in a `test-index`:
+
+     ```json
+     PUT test-index/_doc/1
+     {
+       "names": [ "Slow green", "turtle swims"]
+     }
+     ```
+     {% include copy-curl.html %}
+
+1. Query the document using a `match_phrase` query:
+
+    ```json
+    GET test-index/_search
+    {
+      "query": {
+        "match_phrase": {
+          "names": {
+            "query": "green turtle" 
+          }
+        }
+      }
+    }
+    ```
+    {% include copy-curl.html %}
+
+    The response returns no hits because the distance between the terms `green` and `turtle` is `100` (the default `position_increment_gap`).
+
+1. Now query the document using a `match_phrase` query with a `slop` parameter that is higher than the `position_increment_gap`:
+
+    ```json
+    GET test-index/_search
+    {
+      "query": {
+        "match_phrase": {
+          "names": {
+            "query": "green turtle",
+            "slop": 101
+          }
+        }
+      }
+    }
+    ```
+    {% include copy-curl.html %}
+
+    The response contains the matching document:
+
+    ```json
+    {
+      "took": 4,
+      "timed_out": false,
+      "_shards": {
+        "total": 1,
+        "successful": 1,
+        "skipped": 0,
+        "failed": 0
+      },
+      "hits": {
+        "total": {
+          "value": 1,
+          "relation": "eq"
+        },
+        "max_score": 0.010358453,
+        "hits": [
+          {
+            "_index": "test-index",
+            "_id": "1",
+            "_score": 0.010358453,
+            "_source": {
+              "names": [
+                "Slow green",
+                "turtle swims"
+              ]
+            }
+          }
+        ]
+      }
+    }
+    ```
diff --git a/_analyzers/index.md b/_analyzers/index.md
@@ -51,7 +51,7 @@ For a list of supported analyzers, see [Analyzers]({{site.url}}{{site.baseurl}}/
 
 ## Custom analyzers
 
-If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer.
+If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer. For more information, see [Creating a custom analyzer]({{site.url}}{{site.baseurl}}/analyzers/custom-analyzer/).
 
 ## Text analysis at indexing time and query time
 

diff --git a/_analyzers/language-analyzers/index.md b/_analyzers/language-analyzers/index.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Language analyzers
-nav_order: 100
+nav_order: 140
 parent: Analyzers
 has_children: true
 has_toc: true

diff --git a/_analyzers/normalizers.md b/_analyzers/normalizers.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Normalizers
-nav_order: 100
+nav_order: 110
 ---
 
 # Normalizers
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		* @kolchfa-aws @Naarcha-AWS @vagimeli @AMoo-Miki @natebower @dlvenable @stephen-crawford @epugh
		* @kolchfa-aws @Naarcha-AWS @AMoo-Miki @natebower @dlvenable @epugh