Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Document delimited term frequency token filter #4986

Closed
1 of 4 tasks
noCharger opened this issue Sep 7, 2023 · 9 comments · Fixed by #5043
Closed
1 of 4 tasks

[DOC] Document delimited term frequency token filter #4986

noCharger opened this issue Sep 7, 2023 · 9 comments · Fixed by #5043
Assignees
Labels
3 - Done Issue is done/complete v2.10.0
Milestone

Comments

@noCharger
Copy link
Contributor

noCharger commented Sep 7, 2023

What do you want to do?

  • Request a change to existing documentation
  • Add new documentation
  • Report a technical problem with the documentation
  • Other

Tell us about your request. Provide a summary of the request and all versions that are affected.

Starting in version 2.10, OpenSearch will support a new token filter delimited_term_freq. Let's add a reference to all existing token filters and link it to the pages like https://opensearch.org/docs/latest/api-reference/analyze-apis/terminology/#token-filters and https://opensearch.org/docs/latest/analyzers/index/

What other resources are available? Provide links to related issues, POCs, steps for testing, etc.
Feature request opensearch-project/OpenSearch#9413
PR opensearch-project/OpenSearch#9479

@noCharger noCharger moved this from 🆕 New to Now(This Quarter) in Search Project Board Sep 7, 2023
@kolchfa-aws kolchfa-aws self-assigned this Sep 7, 2023
@kolchfa-aws
Copy link
Collaborator

We'll document this once we add at least a skeleton section for token filters.

@kolchfa-aws kolchfa-aws added the 1 - Backlog Issue: The issue is unassigned or assigned but not started label Sep 11, 2023
@macohen
Copy link
Contributor

macohen commented Sep 14, 2023

What do you think of putting this under: https://opensearch.org/docs/latest/analyzers/index/? I think a new section under Text Analysis called Token Filters would do it. I would prefer to start something here and even just add the one token filter than document all of them now and wait to improve the docs.

@macohen
Copy link
Contributor

macohen commented Sep 14, 2023

Ran "find . -name "*.java" -exec grep "AbstractTokenFilterFactory" {} ; -print" on the OpenSearch repo and generated the output in this gist: https://gist.github.com/macohen/9f335a741677fac2e916cf980f8019fe. Probably could edit that down to a list of new issues to chip away.

@kolchfa-aws
Copy link
Collaborator

Let's keep this issue as just the new 2.10 delimited_term_freq token filter. We already have issues to document all token filters #790 and #1483.

@kolchfa-aws kolchfa-aws changed the title [DOC] Document all token filter refs [DOC] Document delimited term frequency token filter Sep 14, 2023
@macohen
Copy link
Contributor

macohen commented Sep 14, 2023

Sounds good to me. Thanks. I linked to the gist in #790 as well if it helps.

@hdhalter hdhalter added this to the v2.10 milestone Sep 14, 2023
@kolchfa-aws
Copy link
Collaborator

Thank you, @macohen!

@msfroh msfroh moved this from Now(This Quarter) to 🏗 In progress in Search Project Board Sep 14, 2023
@noCharger
Copy link
Contributor Author

noCharger commented Sep 15, 2023

Example

PUT /test
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "my_delimited_term_freq": {
          "type": "delimited_term_freq",
          "delimiter": "^"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "f1": {
        "type": "keyword"
      },
      "f2": {
        "type": "text"
      }
    }
  }
}

# Analyze Text with Custom Token Filter:
POST /test/_analyze
{
  "text": "foo^3",
  "tokenizer": "keyword",
  "filter": ["my_delimited_term_freq"],
  "attributes": ["termFrequency"],
  "explain": true
}

# Analyze Text with Pre-configured Token Filter:
POST /_analyze
{
  "text": "foo|100",
  "tokenizer": "keyword",
  "filter": ["delimited_term_freq"],
  "attributes": ["termFrequency"],
  "explain": true
}

@noCharger
Copy link
Contributor Author

Response example

{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "keyword",
      "tokens": [
        {
          "token": "foo^3",
          "start_offset": 0,
          "end_offset": 5,
          "type": "word",
          "position": 0,
          "termFrequency": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "my_delimited_term_freq",
        "tokens": [
          {
            "token": "foo",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 0,
            "termFrequency": 3
          }
        ]
      }
    ]
  }
}

{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "keyword",
      "tokens": [
        {
          "token": "foo|100",
          "start_offset": 0,
          "end_offset": 7,
          "type": "word",
          "position": 0,
          "termFrequency": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "delimited_term_freq",
        "tokens": [
          {
            "token": "foo",
            "start_offset": 0,
            "end_offset": 7,
            "type": "word",
            "position": 0,
            "termFrequency": 100
          }
        ]
      }
    ]
  }
}

@kolchfa-aws kolchfa-aws added 2 - In progress Issue/PR: The issue or PR is in progress. and removed 1 - Backlog Issue: The issue is unassigned or assigned but not started labels Sep 15, 2023
@noCharger
Copy link
Contributor Author

noCharger commented Sep 19, 2023

Add e2e example combine delimited token filter with scripts


PUT /test
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "tokenizer": {
        "keyword_tokenizer": {
          "type": "keyword"
        }
      },
      "filter": {
        "my_delimited_term_freq": {
          "type": "delimited_term_freq",
          "delimiter": "^"
        }
      },
      "analyzer": {
        "custom_delimited_analyzer": {
          "tokenizer": "keyword_tokenizer",
          "filter": ["my_delimited_term_freq"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "f1": {
        "type": "keyword"
      },
      "f2": {
        "type": "text",
        "similarity": "BM25",
        "analyzer": "custom_delimited_analyzer",
        "index_options": "freqs"
      }
    }
  }
}

POST /_bulk?refresh=true
{"index": {"_index": "test", "_id": "doc1"}}
{"f1": "v0|100", "f2": "v1^30"}
{"index": {"_index": "test", "_id": "doc2"}}
{"f2": "v2|100"}

GET /test/_search
{
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "script_score": {
        "script": {
          "source": "termFreq(params.field, params.term)",
          "params": {
            "field": "f2",
            "term": "v1"
          }
        }
      }
    }
  }
}

Example response

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 30,
    "hits": [
      {
        "_index": "test",
        "_id": "doc1",
        "_score": 30,
        "_source": {
          "f1": "v0|100",
          "f2": "v1^30"
        }
      },
      {
        "_index": "test",
        "_id": "doc2",
        "_score": 0,
        "_source": {
          "f2": "v2|100"
        }
      }
    ]
  }
}

@hdhalter hdhalter added 3 - Done Issue is done/complete and removed 2 - In progress Issue/PR: The issue or PR is in progress. labels Sep 20, 2023
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Search Project Board Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Done Issue is done/complete v2.10.0
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants