[DOC] Document delimited term frequency token filter #4986

noCharger · 2023-09-07T18:25:56Z

What do you want to do?

Request a change to existing documentation
Add new documentation
Report a technical problem with the documentation
Other

Tell us about your request. Provide a summary of the request and all versions that are affected.

Starting in version 2.10, OpenSearch will support a new token filter delimited_term_freq. Let's add a reference to all existing token filters and link it to the pages like https://opensearch.org/docs/latest/api-reference/analyze-apis/terminology/#token-filters and https://opensearch.org/docs/latest/analyzers/index/

What other resources are available? Provide links to related issues, POCs, steps for testing, etc.
Feature request opensearch-project/OpenSearch#9413
PR opensearch-project/OpenSearch#9479

The text was updated successfully, but these errors were encountered:

kolchfa-aws · 2023-09-07T18:27:38Z

We'll document this once we add at least a skeleton section for token filters.

macohen · 2023-09-14T15:27:38Z

What do you think of putting this under: https://opensearch.org/docs/latest/analyzers/index/? I think a new section under Text Analysis called Token Filters would do it. I would prefer to start something here and even just add the one token filter than document all of them now and wait to improve the docs.

macohen · 2023-09-14T15:40:31Z

Ran "find . -name "*.java" -exec grep "AbstractTokenFilterFactory" {} ; -print" on the OpenSearch repo and generated the output in this gist: https://gist.github.com/macohen/9f335a741677fac2e916cf980f8019fe. Probably could edit that down to a list of new issues to chip away.

kolchfa-aws · 2023-09-14T15:51:35Z

Let's keep this issue as just the new 2.10 delimited_term_freq token filter. We already have issues to document all token filters #790 and #1483.

macohen · 2023-09-14T16:14:33Z

Sounds good to me. Thanks. I linked to the gist in #790 as well if it helps.

kolchfa-aws · 2023-09-14T17:10:24Z

Thank you, @macohen!

noCharger · 2023-09-15T16:59:33Z

Example

PUT /test
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "my_delimited_term_freq": {
          "type": "delimited_term_freq",
          "delimiter": "^"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "f1": {
        "type": "keyword"
      },
      "f2": {
        "type": "text"
      }
    }
  }
}

# Analyze Text with Custom Token Filter:
POST /test/_analyze
{
  "text": "foo^3",
  "tokenizer": "keyword",
  "filter": ["my_delimited_term_freq"],
  "attributes": ["termFrequency"],
  "explain": true
}

# Analyze Text with Pre-configured Token Filter:
POST /_analyze
{
  "text": "foo|100",
  "tokenizer": "keyword",
  "filter": ["delimited_term_freq"],
  "attributes": ["termFrequency"],
  "explain": true
}

noCharger · 2023-09-15T18:10:24Z

Response example

{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "keyword",
      "tokens": [
        {
          "token": "foo^3",
          "start_offset": 0,
          "end_offset": 5,
          "type": "word",
          "position": 0,
          "termFrequency": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "my_delimited_term_freq",
        "tokens": [
          {
            "token": "foo",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 0,
            "termFrequency": 3
          }
        ]
      }
    ]
  }
}

{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "keyword",
      "tokens": [
        {
          "token": "foo|100",
          "start_offset": 0,
          "end_offset": 7,
          "type": "word",
          "position": 0,
          "termFrequency": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "delimited_term_freq",
        "tokens": [
          {
            "token": "foo",
            "start_offset": 0,
            "end_offset": 7,
            "type": "word",
            "position": 0,
            "termFrequency": 100
          }
        ]
      }
    ]
  }
}

noCharger · 2023-09-19T16:19:33Z

Add e2e example combine delimited token filter with scripts


PUT /test
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "tokenizer": {
        "keyword_tokenizer": {
          "type": "keyword"
        }
      },
      "filter": {
        "my_delimited_term_freq": {
          "type": "delimited_term_freq",
          "delimiter": "^"
        }
      },
      "analyzer": {
        "custom_delimited_analyzer": {
          "tokenizer": "keyword_tokenizer",
          "filter": ["my_delimited_term_freq"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "f1": {
        "type": "keyword"
      },
      "f2": {
        "type": "text",
        "similarity": "BM25",
        "analyzer": "custom_delimited_analyzer",
        "index_options": "freqs"
      }
    }
  }
}

POST /_bulk?refresh=true
{"index": {"_index": "test", "_id": "doc1"}}
{"f1": "v0|100", "f2": "v1^30"}
{"index": {"_index": "test", "_id": "doc2"}}
{"f2": "v2|100"}

GET /test/_search
{
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "script_score": {
        "script": {
          "source": "termFreq(params.field, params.term)",
          "params": {
            "field": "f2",
            "term": "v1"
          }
        }
      }
    }
  }
}

Example response

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 30,
    "hits": [
      {
        "_index": "test",
        "_id": "doc1",
        "_score": 30,
        "_source": {
          "f1": "v0|100",
          "f2": "v1^30"
        }
      },
      {
        "_index": "test",
        "_id": "doc2",
        "_score": 0,
        "_source": {
          "f2": "v2|100"
        }
      }
    ]
  }
}

noCharger added the untriaged label Sep 7, 2023

noCharger added this to Search Project Board Sep 7, 2023

github-project-automation bot moved this to 🆕 New in Search Project Board Sep 7, 2023

noCharger moved this from 🆕 New to Now(This Quarter) in Search Project Board Sep 7, 2023

kolchfa-aws self-assigned this Sep 7, 2023

kolchfa-aws added v2.10.0 and removed untriaged labels Sep 7, 2023

kolchfa-aws added the 1 - Backlog Issue: The issue is unassigned or assigned but not started label Sep 11, 2023

kolchfa-aws changed the title ~~[DOC] Document all token filter refs~~ [DOC] Document delimited term frequency token filter Sep 14, 2023

hdhalter added this to the v2.10 milestone Sep 14, 2023

msfroh moved this from Now(This Quarter) to 🏗 In progress in Search Project Board Sep 14, 2023

kolchfa-aws added 2 - In progress Issue/PR: The issue or PR is in progress. and removed 1 - Backlog Issue: The issue is unassigned or assigned but not started labels Sep 15, 2023

kolchfa-aws mentioned this issue Sep 18, 2023

Add delimited term frequency token filter documentation #5043

Merged

1 task

hdhalter added 3 - Done Issue is done/complete and removed 2 - In progress Issue/PR: The issue or PR is in progress. labels Sep 20, 2023

kolchfa-aws closed this as completed in #5043 Sep 22, 2023

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Search Project Board Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOC] Document delimited term frequency token filter #4986

[DOC] Document delimited term frequency token filter #4986

noCharger commented Sep 7, 2023 •

edited

Loading

kolchfa-aws commented Sep 7, 2023

macohen commented Sep 14, 2023

macohen commented Sep 14, 2023

kolchfa-aws commented Sep 14, 2023

macohen commented Sep 14, 2023

kolchfa-aws commented Sep 14, 2023

noCharger commented Sep 15, 2023 •

edited

Loading

noCharger commented Sep 15, 2023

noCharger commented Sep 19, 2023 •

edited

Loading

[DOC] Document delimited term frequency token filter #4986

[DOC] Document delimited term frequency token filter #4986

Comments

noCharger commented Sep 7, 2023 • edited Loading

kolchfa-aws commented Sep 7, 2023

macohen commented Sep 14, 2023

macohen commented Sep 14, 2023

kolchfa-aws commented Sep 14, 2023

macohen commented Sep 14, 2023

kolchfa-aws commented Sep 14, 2023

noCharger commented Sep 15, 2023 • edited Loading

noCharger commented Sep 15, 2023

noCharger commented Sep 19, 2023 • edited Loading

noCharger commented Sep 7, 2023 •

edited

Loading

noCharger commented Sep 15, 2023 •

edited

Loading

noCharger commented Sep 19, 2023 •

edited

Loading