Skip to content

Commit

Permalink
Merge branch 'main' into innerhit
Browse files Browse the repository at this point in the history
  • Loading branch information
heemin32 authored Dec 11, 2024
2 parents fe4894e + 23729b7 commit 8ed6777
Show file tree
Hide file tree
Showing 131 changed files with 6,319 additions and 2,061 deletions.
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1 +1 @@
* @kolchfa-aws @Naarcha-AWS @vagimeli @AMoo-Miki @natebower @dlvenable @stephen-crawford @epugh
* @kolchfa-aws @Naarcha-AWS @AMoo-Miki @natebower @dlvenable @epugh
4 changes: 2 additions & 2 deletions .github/workflows/pr_checklist.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
with:
script: |
let assignee = context.payload.pull_request.user.login;
const prOwners = ['Naarcha-AWS', 'kolchfa-aws', 'vagimeli', 'natebower'];
const prOwners = ['Naarcha-AWS', 'kolchfa-aws', 'natebower'];
if (!prOwners.includes(assignee)) {
assignee = 'kolchfa-aws'
Expand All @@ -40,4 +40,4 @@ jobs:
owner: context.repo.owner,
repo: context.repo.repo,
assignees: [assignee]
});
});
1 change: 0 additions & 1 deletion .ruby-version

This file was deleted.

10 changes: 5 additions & 5 deletions MAINTAINERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,14 @@ This document lists the maintainers in this repo. See [opensearch-project/.githu
| Fanit Kolchina | [kolchfa-aws](https://github.com/kolchfa-aws) | Amazon |
| Nate Archer | [Naarcha-AWS](https://github.com/Naarcha-AWS) | Amazon |
| Nathan Bower | [natebower](https://github.com/natebower) | Amazon |
| Melissa Vagi | [vagimeli](https://github.com/vagimeli) | Amazon |
| Miki Barahmand | [AMoo-Miki](https://github.com/AMoo-Miki) | Amazon |
| David Venable | [dlvenable](https://github.com/dlvenable) | Amazon |
| Stephen Crawford | [stephen-crawford](https://github.com/stephen-crawford) | Amazon |
| Eric Pugh | [epugh](https://github.com/epugh) | OpenSource Connections |

## Emeritus

| Maintainer | GitHub ID | Affiliation |
| ---------------- | ----------------------------------------------- | ----------- |
| Heather Halter | [hdhalter](https://github.com/hdhalter) | Amazon |
| Maintainer | GitHub ID | Affiliation |
| ---------------- | ------------------------------------------------------- | ----------- |
| Heather Halter | [hdhalter](https://github.com/hdhalter) | Amazon |
| Melissa Vagi | [vagimeli](https://github.com/vagimeli) | Amazon |
| Stephen Crawford | [stephen-crawford](https://github.com/stephen-crawford) | Amazon |
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ If you encounter problems or have questions when contributing to the documentati

- [kolchfa-aws](https://github.com/kolchfa-aws)
- [Naarcha-AWS](https://github.com/Naarcha-AWS)
- [vagimeli](https://github.com/vagimeli)


## Code of conduct
Expand Down
1 change: 1 addition & 0 deletions _about/version-history.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ OpenSearch version | Release highlights | Release date
[2.0.1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.1.md) | Includes bug fixes and maintenance updates for Alerting and Anomaly Detection. | 16 June 2022
[2.0.0](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0.md) | Includes document-level monitors for alerting, OpenSearch Notifications plugins, and Geo Map Tiles in OpenSearch Dashboards. Also adds support for Lucene 9 and bug fixes for all OpenSearch plugins. For a full list of release highlights, see the Release Notes. | 26 May 2022
[2.0.0-rc1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0-rc1.md) | The Release Candidate for 2.0.0. This version allows you to preview the upcoming 2.0.0 release before the GA release. The preview release adds document-level alerting, support for Lucene 9, and the ability to use term lookup queries in document level security. | 03 May 2022
[1.3.20](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.20.md) | Includes enhancements to Anomaly Detection Dashboards, bug fixes for Alerting and Dashboards Reports, and maintenance updates for several OpenSearch components. | 11 December 2024
[1.3.19](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.19.md) | Includes bug fixes and maintenance updates for OpenSearch security, OpenSearch security Dashboards, and anomaly detection. | 27 August 2024
[1.3.18](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.18.md) | Includes maintenance updates for OpenSearch security. | 16 July 2024
[1.3.17](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.17.md) | Includes maintenance updates for OpenSearch security and OpenSearch Dashboards security. | 06 June 2024
Expand Down
312 changes: 312 additions & 0 deletions _analyzers/custom-analyzer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,312 @@
---
layout: default
title: Creating a custom analyzer
nav_order: 40
parent: Analyzers
---

# Creating a custom analyzer

To create a custom analyzer, specify a combination of the following components:

- Character filters (zero or more)

- Tokenizer (one)

- Token filters (zero or more)

## Configuration

The following parameters can be used to configure a custom analyzer.

| Parameter | Required/Optional | Description |
|:--- | :--- | :--- |
| `type` | Optional | The analyzer type. Default is `custom`. You can also specify a prebuilt analyzer using this parameter. |
| `tokenizer` | Required | A tokenizer to be included in the analyzer. |
| `char_filter` | Optional | A list of character filters to be included in the analyzer. |
| `filter` | Optional | A list of token filters to be included in the analyzer. |
| `position_increment_gap` | Optional | The extra spacing applied between values when indexing text fields that have multiple values. For more information, see [Position increment gap](#position-increment-gap). Default is `100`. |

## Examples

The following examples demonstrate various custom analyzer configurations.

### Custom analyzer with a character filter for HTML stripping

The following example analyzer removes HTML tags from text before tokenization:

```json
PUT simple_html_strip_analyzer_index
{
"settings": {
"analysis": {
"analyzer": {
"html_strip_analyzer": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "whitespace",
"filter": ["lowercase"]
}
}
}
}
}
```
{% include copy-curl.html %}

Use the following request to examine the tokens generated using the analyzer:

```json
GET simple_html_strip_analyzer_index/_analyze
{
"analyzer": "html_strip_analyzer",
"text": "<p>OpenSearch is <strong>awesome</strong>!</p>"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "opensearch",
"start_offset": 3,
"end_offset": 13,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 14,
"end_offset": 16,
"type": "word",
"position": 1
},
{
"token": "awesome!",
"start_offset": 25,
"end_offset": 42,
"type": "word",
"position": 2
}
]
}
```

### Custom analyzer with a mapping character filter for synonym replacement

The following example analyzer replaces specific characters and patterns before applying the synonym filter:

```json
PUT mapping_analyzer_index
{
"settings": {
"analysis": {
"analyzer": {
"synonym_mapping_analyzer": {
"type": "custom",
"char_filter": ["underscore_to_space"],
"tokenizer": "standard",
"filter": ["lowercase", "stop", "synonym_filter"]
}
},
"char_filter": {
"underscore_to_space": {
"type": "mapping",
"mappings": ["_ => ' '"]
}
},
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"quick, fast, speedy",
"big, large, huge"
]
}
}
}
}
}
```
{% include copy-curl.html %}

Use the following request to examine the tokens generated using the analyzer:

```json
GET mapping_analyzer_index/_analyze
{
"analyzer": "synonym_mapping_analyzer",
"text": "The slow_green_turtle is very large"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{"token": "slow","start_offset": 4,"end_offset": 8,"type": "<ALPHANUM>","position": 1},
{"token": "green","start_offset": 9,"end_offset": 14,"type": "<ALPHANUM>","position": 2},
{"token": "turtle","start_offset": 15,"end_offset": 21,"type": "<ALPHANUM>","position": 3},
{"token": "very","start_offset": 25,"end_offset": 29,"type": "<ALPHANUM>","position": 5},
{"token": "large","start_offset": 30,"end_offset": 35,"type": "<ALPHANUM>","position": 6},
{"token": "big","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6},
{"token": "huge","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6}
]
}
```

### Custom analyzer with a custom pattern-based character filter for number normalization

The following example analyzer normalizes phone numbers by removing dashes and spaces and applies edge n-grams to the normalized text to support partial matches:

```json
PUT advanced_pattern_replace_analyzer_index
{
"settings": {
"analysis": {
"analyzer": {
"phone_number_analyzer": {
"type": "custom",
"char_filter": ["phone_normalization"],
"tokenizer": "standard",
"filter": ["lowercase", "edge_ngram"]
}
},
"char_filter": {
"phone_normalization": {
"type": "pattern_replace",
"pattern": "[-\\s]",
"replacement": ""
}
},
"filter": {
"edge_ngram": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
}
}
}
}
}
```
{% include copy-curl.html %}

Use the following request to examine the tokens generated using the analyzer:

```json
GET advanced_pattern_replace_analyzer_index/_analyze
{
"analyzer": "phone_number_analyzer",
"text": "123-456 7890"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{"token": "123","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "1234","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "12345","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "123456","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "1234567","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "12345678","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "123456789","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "1234567890","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0}
]
}
```

## Position increment gap

The `position_increment_gap` parameter sets a positional gap between terms when indexing multi-valued fields, such as arrays. This gap ensures that phrase queries don't match terms across separate values unless explicitly allowed. For example, a default gap of 100 specifies that terms in different array entries are 100 positions apart, preventing unintended matches in phrase searches. You can adjust this value or set it to `0` in order to allow phrases to span across array values.

The following example demonstrates the effect of `position_increment_gap` using a `match_phrase` query.

1. Index a document in a `test-index`:

```json
PUT test-index/_doc/1
{
"names": [ "Slow green", "turtle swims"]
}
```
{% include copy-curl.html %}

1. Query the document using a `match_phrase` query:

```json
GET test-index/_search
{
"query": {
"match_phrase": {
"names": {
"query": "green turtle"
}
}
}
}
```
{% include copy-curl.html %}

The response returns no hits because the distance between the terms `green` and `turtle` is `100` (the default `position_increment_gap`).

1. Now query the document using a `match_phrase` query with a `slop` parameter that is higher than the `position_increment_gap`:

```json
GET test-index/_search
{
"query": {
"match_phrase": {
"names": {
"query": "green turtle",
"slop": 101
}
}
}
}
```
{% include copy-curl.html %}

The response contains the matching document:

```json
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.010358453,
"hits": [
{
"_index": "test-index",
"_id": "1",
"_score": 0.010358453,
"_source": {
"names": [
"Slow green",
"turtle swims"
]
}
}
]
}
}
```
2 changes: 1 addition & 1 deletion _analyzers/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ For a list of supported analyzers, see [Analyzers]({{site.url}}{{site.baseurl}}/

## Custom analyzers

If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer.
If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer. For more information, see [Creating a custom analyzer]({{site.url}}{{site.baseurl}}/analyzers/custom-analyzer/).

## Text analysis at indexing time and query time

Expand Down
2 changes: 1 addition & 1 deletion _analyzers/language-analyzers/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Language analyzers
nav_order: 100
nav_order: 140
parent: Analyzers
has_children: true
has_toc: true
Expand Down
2 changes: 1 addition & 1 deletion _analyzers/normalizers.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Normalizers
nav_order: 100
nav_order: 110
---

# Normalizers
Expand Down
Loading

0 comments on commit 8ed6777

Please sign in to comment.