Skip to content

Commit

Permalink
[8.8] [DOCS] Adds section about tokens to ELSER conceptual (backport #…
Browse files Browse the repository at this point in the history
…2568) (#2572)

Co-authored-by: István Zoltán Szabó <[email protected]>
  • Loading branch information
mergify[bot] and szabosteve authored Oct 18, 2023
1 parent f537b45 commit 7b38f8d
Showing 1 changed file with 19 additions and 4 deletions.
23 changes: 19 additions & 4 deletions docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,28 @@ meaning and user intent, rather than exact keyword matches.
ELSER is an out-of-domain model which means it does not require fine-tuning on
your own data, making it adaptable for various use cases out of the box.


[discrete]
[[elser-tokens]]
== Tokens - not synonyms

ELSER expands the indexed and searched passages into collections of terms that
are learned to co-occur frequently within a diverse set of training data. The
terms that the text is expanded into by the model _are not_ synonyms for the
search terms; they are learned associations. These expanded terms are weighted
as some of them are more significant than others. Then the {es}
{ref}/rank-features.html[rank features field type] is used to store the terms
and weights at index time, and to search against later.
search terms; they are learned associations capturing relevance. These expanded
terms are weighted as some of them are more significant than others. Then the
{es} {ref}/rank-features.html[rank features] field type is used to store the
terms and weights at index time, and to search against later.

This approach provides a more understandable search experience compared to
vector embeddings. However, attempting to directly interpret the tokens and
weights can be misleading, as the expansion essentially results in a vector in a
very high-dimensional space. Consequently, certain tokens, especially those with
low weight, contain information that is intertwined with other low-weight tokens
in the representation. In this regard, they function similarly to a dense vector
representation, making it challenging to separate their individual
contributions. This complexity can potentially lead to misinterpretations if not
carefully considered during analysis.


[discrete]
Expand Down

0 comments on commit 7b38f8d

Please sign in to comment.