Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.18] add whitespace analyzer docs #8916

Merged
merged 1 commit into from
Dec 10, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions _analyzers/whitespace.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
layout: default
title: Whitespace analyzer
nav_order: 60
---

# Whitespace analyzer

Check failure on line 7 in _analyzers/whitespace.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _analyzers/whitespace.md#L7

[OpenSearch.SubstitutionsError] Use 'white space' instead of 'Whitespace'.
Raw output
{"message": "[OpenSearch.SubstitutionsError] Use 'white space' instead of 'Whitespace'.", "location": {"path": "_analyzers/whitespace.md", "range": {"start": {"line": 7, "column": 3}}}, "severity": "ERROR"}

The `whitespace` analyzer breaks text into tokens based only on white space characters (for example, spaces and tabs). It does not apply any transformations, such as lowercasing or removing stopwords, so the original case of the text is retained and punctuation is included as part of the tokens.

Check failure on line 9 in _analyzers/whitespace.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _analyzers/whitespace.md#L9

[OpenSearch.Spelling] Error: lowercasing. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: lowercasing. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_analyzers/whitespace.md", "range": {"start": {"line": 9, "column": 167}}}, "severity": "ERROR"}

## Example

Use the following command to create an index named `my_whitespace_index` with a `whitespace` analyzer:

```json
PUT /my_whitespace_index
{
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
```
{% include copy-curl.html %}

## Configuring a custom analyzer

Use the following command to configure an index with a custom analyzer that is equivalent to a `whitespace` analyzer with an added `lowercase` character filter:

```json
PUT /my_custom_whitespace_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_custom_whitespace_analyzer"
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /my_custom_whitespace_index/_analyze
{
"analyzer": "my_custom_whitespace_analyzer",
"text": "The SLOW turtle swims away! 123"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{"token": "the","start_offset": 0,"end_offset": 3,"type": "word","position": 0},
{"token": "slow","start_offset": 4,"end_offset": 8,"type": "word","position": 1},
{"token": "turtle","start_offset": 9,"end_offset": 15,"type": "word","position": 2},
{"token": "swims","start_offset": 16,"end_offset": 21,"type": "word","position": 3},
{"token": "away!","start_offset": 22,"end_offset": 27,"type": "word","position": 4},
{"token": "123","start_offset": 28,"end_offset": 31,"type": "word","position": 5}
]
}
```
Loading