Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] High Level Approach and Design For Normalization and Score Combination #126

Closed
navneet1v opened this issue Feb 28, 2023 · 26 comments
Closed
Assignees
Labels
Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement neural-search RFC v2.10.0 Issues targeting release v2.10.0

Comments

@navneet1v
Copy link
Collaborator

navneet1v commented Feb 28, 2023

Introduction

This issue talks about the various high level directions which are being proposed for Score combination and Normalization techniques to improve Semantic Search/Neural Search Queries in OpenSearch(META ISSUE: #123). The proposals tries to use already created extensions in OpenSearch as much as possible. Also, we try to make sure that directions we are choosing are long term; provides different level of customizations for users to tweak the Semantics Search as per their needs. The document also proposes phased design and implementation plan which will help us to improve and add new feature with every phase.

For information how normalization improves the overall Quality of Results please refer this OpenSearch Blog for Science Benchmarks: https://opensearch.org/blog/semantic-science-benchmarks/

Current Architecture

For simplicity, let's us consider a 3 node OpenSearch cluster, where we have 2 data nodes and 1 coordinator node. The data node stores the data and coordinator node helps in running the request. This is how OpenSearch works on a very high level.

  1. Request lands on a Coordinator node and this node will rewrite the queries and send out the request to all the required shards as part of Query Phase.
  2. The shards then use that query and pass it to Lucene(Index Search) along with all the relevant Collectors. Lucene then again rewrite the query and start performing the leaf queries on the segments sequentially. Its uses the different collectors like TopDocsCollector to collect the documents with Top Scores and return this result back to Coordinator node.
  3. Once the query results arrive on the Coordinator node, we move to next phase which is Fetch phase. In the Fetch phase we have different sub-fetch phases that runs. Important one is Source Phase and Score phase. The Score phase will again send out the request to Data nodes to get the sources for the document ids and build the response to send back to the clients.

NormalizationInNeuralSearch-Current HLD drawio (1)

Requirements

The customer here is referenced as the OpenSearch customers, who wants to use OpenSearch and wants to run Semantics Search in their application.

  1. As a customer, I want to combine different results coming from Transformers query(K-NN Query) and BM-25 based queries to obtain better search results. Github link: [FEATURE] Hybrid search using keyword matching and kNN k-NN#717 (comment)
  2. As a customer I want to experiment with different ways to combine and rank the results in final search output. Github Link: Discussion on combination of result sets from different query types OpenSearch#4557

Why current requirement cannot be fulfilled with current OpenSearch Architecture?

Currently, OpenSearch uses a Query and Fetch model to do the search. In this, first the whole query object is passed to all the shards, to obtain the top results from those shards and then fetch phase begins which fetches the relevant information for the documents. If we talk about 2 types of queries in a typical Semantic Search use case will have a K-NN query and text match query both of which produces scores using different scoring methods.

So at first, it is difficult to combine result sets whose scores are produced via different scoring methods. In order to effectively combine results, the different queries scores would need to be put on the same scale(refer https://arxiv.org/abs/2210.11934). By this, we mean is that a score would need to meet 2 requirements: (1) indicates its relative relevance between it and the other documents scored in the query and (2) be comparable with the relative relevance of results from other queries. For example, for k-NN, the score range may be 0-1 while BM25 scoring would be 0-Float.MAX. Hence any effective combination query clause like Bool will suffer the problems.
Second, it is not possible to consider global hits for re-ranking. Because scores are assigned at the shard level, any rescoring will be done at the shard level. Hence if we try to normalize the scores, the normalization will be local normalization and not the global normalization.
Let’s use below example to understand the problem in more details.

Example

Using the same cluster setup as defined on top and look at an example to understand the above 2 problems. For this example lets assume that we have an index whose schema looks like this:

PUT product-info
{
    "mappings": {
        "properties": {
            "title": {
                "type": "text"
            },
            "description": {
                "type": "text"
            },
            "tile_and_descrption_knn": {
                "type": "knn_vector",
                "dimension": 768
            }
        }
    },
    "settings": {
        "index": {
            "refresh_interval": "-1",
            "number_of_shards": "2",
            "knn": "true",
            "number_of_replicas": "0",
            "default_pipeline": "text-embedding-trec-covid"
        }
    }
}

The title and description are product title and description. The tile_and_descrption_knn field is a K-NN vector field which has a 768 dimensions dense vector created using Dense Vector Model.

Query
We are using Bool query clause to combine the results of K-NN(neural query converted to K-NN Query) and a text based search query. Bool Query should clauses have their scores combined — the more matching clauses, the better.

POST product-info/_search
{
    "query" : {
        "bool": {
            "should": [
                {
                    "multi-match": {
                        "query": "sofa-set for living room",
                        "fields": ["tile", "description"]
                    }    
                },
                {
                    "neural": {
                        "tile_and_descrption_knn": {
                            "query_text": "sofa-set for living room",
                            "model_id": "dMGrLIQB8JeDTsKXjXNn"
                        }
                    }
                }
            ]
        }
    }
}

Scores Computation Examples Happening at different Levels
Combination using above provided query: As we can see here because both K-NN and BM-25 scores are at different scale if one of the query behaves badly like example for document d8, it is still coming as the first response because of BM-25. The standard boolean combination is not taking the advantage of relative ranking. Documents like d7 are still lower down the results even when it has good scores in both BM-25 and K-NN. This problem will become more elevated as BM-25 scores are unbounded.

Screen Shot 2023-02-27 at 4 20 20 PM

To solve this problem of joining scores for different queries running at different scale is to first Normalize the scores of both the queries and then combine the scores. We can read about the same here(https://arxiv.org/abs/2210.11934).

Using Normalization: To see how normalization works let’s look at the below table. In the below table we have done 2 types of normalization one which is done for every shard/data node and another at the Coordinator node(Global Normalization).

Note: I did the normalization using the scores present here hence documents have 0 scores after normalization.

Screen Shot 2023-02-27 at 4 20 28 PM

Final Sorted Documents to be returned based on above examples:

Screen Shot 2023-02-27 at 4 20 33 PM

If we keep our focus on the d8 document we can see how that document changes its position based on without normalization, with Local Normalization and with Global Normalization. As Local Normalization only considers the scores at per shard level, it can suffer in cases if one of the score is lowest. But in the Global Normalization(aka Normalization at Coordinator Node level), as we start to look scores from whole corpus it smoothens out the above problem, because it can happen that more bad scores can come from other shards. We did experiments on that to verify it.

System Requirements

Functional Requirements:

  1. System should be able to normalize the scores of each sub-query by using the relevant corpus level information.
  2. System should be able to combine normalized scores of these different subqueries by using different techniques provided as input and by considering global rankings.
  3. System should be able to get the information like max score and min score etc of the whole corpus, as these are required attributes for normalization.
  4. System should be able to combine(not necessary sum) the results returned from different shards based at global level.
  5. The solution should be generic enough to combine the scores of subqueries and not limited to k-NN and text matching.

Good to Have but not as P0:

  1. The proposed solution should able to provide support different features like pagination, scripting, explain etc.
  2. The solution shouldn’t degrade in terms of latency and CPU consumption over a normal query which does combination at Shard Level.

High Level Directions for Building the Solution

If we look at the requirements we see that we need solutions at different level of the whole search api flow. We can divide the whole flow in 3 parts:

  1. Define the API input which can be used to run the sub-queries independently and return results at Coordinator node separately.
  2. Obtain the relevant information like max scores, min scores at the global level for the provided sub-queries and perform normalization.
  3. Define a component to normalize the all results obtained from different shards and combine them based on algorithms provided by users. After that let the Fetch phase run, to get the source for the documents ids, as it was working earlier.

Defining _search API Input

The proposal for the input is to use the _search api and define a new compound query clause. This new compound query clause will hold the array of Queries which will be executed in parallel at per data node level. The name of the new query clause is not yet defined. The interface of this new query clause is inspired from dis_max query clause. But dis_max query clause runs the queries sequentially. The new query clause will make sure that Scores are calculated at shard level independently for each sub query. The sub-queries re-writing will be done at Coordinator level to avoid duplicate computations.

Note: The interfaces defined here are not finalized. The interfaces will be refined as part of LLD github proposals. But we want to make sure that we align ourselves with high level approach. first

PUT <index-name>/_search
{
    "query": {
        "<new-compound-query-clause>": {
            "queries": [
                { /* neural query */ }, // this is added for an example
                { /* standard text search */ } // If a user want to boost some scores or update 
               // the scores he need to go ahead and do it in this query clause
            ],
            ... other things to be added and will come as part of next sections
        }
    }
}

Pros:

  1. From customer standpoint, all their API calls will remain same, and they need to update only the body of the request.
  2. From cohesion standpoint, as we are doing the search it make sense to include with the _search api to provide a unified experience for customers who are doing search via OpenSearch.
  3. Less maintenance and consistent output format, as the new compound query is integrated with _search api.
  4. Integration with other search capabilities like Explain Query, Pagination, _msearch will be possible, rather than reinventing the wheel.

Cons:

  1. From implementation standpoint, we need define new concepts in OpenSearch like new Query clause, which will require customer education in terms of how to use it.

Alternatives Considered

Alternative-1: Implement a new Rest Handler instead of using creating a new compound query
The idea here to create a new rest handlers which define the list of queries whose scores needs to be normalized and combined.
Pros:

  1. This will provide flexibility for the team to do experiments without touching core capabilities of OpenSearch.
  2. Easier implementation as the new rest handlers is limited to Neural Search Plugin.

Cons:

  1. Duplicate code and interfaces as we will be implementing the same search api functionality(size, from and to, include source fields, scripting etc.)
  2. A higher learning curve and difficult in adoption for customers who are already using _search api for other search workloads.

Obtaining Relevant Information for Normalization and score Combination

This section talks about how OpenSearch will get the relevant information required for doing the Normalization. For example purpose lets say customer has defined the min-max normalization so for every subquery we will need the min and max sore for the whole corpus.
OpenSearch during the query phase it uses a QueryPhaseSearcher class to do the Query and collect the documents at shard level using TopDocsCollector interface. There is no extension point present in the QueryPhaseSearcher to use a different implementation of TopDocsCollector. The only extension we have is that a plugin can define a new QueryPhaseSearcher implementation. So we will define a new QueryPhaseSearcher implementation which will implement a new TopDocsCollector interface at shard level to gather the relevant information for doing normalization.

Pros:

  1. It provides a cleaner way to modify the query phase at a shard level without adding extra round trips.
  2. The interface is provides us full power on how to implement the query and return the results, so we can keep on updating this

Cons:

  1. As of 1/18/2023, only a single plugin can define the queryphase searcher in the whole OpenSearch. We need to fix this to make sure that multiple QueryPhaseSearcher implementations can be added by plugins. This is similar to what we did for K-NN where only 1 engine implementation can be defined for the whole OpenSearch and CCR and K-NN was not working together.
  2. We can get possible conflicts with Concurrent Phase Searcher defined here, as it also defines a query phase searcher.

Alternatives Considered

Alternative1: Enhance DFS Query Search type or create a new Query Search type to get information for doing normalization
The default search type in OpenSearch is Query and the Fetch, which first query the results and then fetch the actual source from the shards. The DFS query is another search type which customer can put as a query params to change the search type. In DFS Query and Fetch, OS will first Prequery each shard asking about Term and Document frequencies and send this information to all the shards where scores are calculated using this global Term/Document Frequencies.

We can build similar to this, where we can do the pre-query to find the min-max scores from all the shards and then pass this information where each shard can do the normalization of the scores for each sub-query.

Pros:

  1. This avoids adding new phases/transformers in between query and fetch phases.

Cons:

  1. An extra round trip to data nodes will be added which will lead to increase in latency. DFS query and fetch also faces this extra latency.
  2. For DFS query and Fetch the information is already precomputed and present in the Lucene files (postings format files but for Normalization as the score calculation needs to be done we will end up running the queries twice. For K-NN queries this pre-fetch can only be done if we run the K-NN query.

Normalizing and Combining Scores using Search Pipeline

The idea here is to extend the Search Pipeline Request and Response transformers to create another type of Transformer which will be called after the query phase is completed. We will use this transformer interface to do the normalization and score combinations for the document ids returned from Query phase as per the user inputs. The transformed result will then be passed on the Fetch phase which will run as it is.
Below is the modified input of the above proposed API input. It adds relevant fields to do the Normalization and Score Combination.

Note: The interfaces defined here are not finalized. The interfaces will be refined as part of LLD github proposals. But we want to make sure that we align ourselves with high level approach.

PUT /_search_processing/pipeline/my_pipeline
{
  "description": "A pipeline that helps in doing the normalization",
  "<in-between-query-fetch-phase-processor>": [
    {
        "normalizaton-processor": {
            // we can bring in the normalization info from _search api to this place if required
            // It will be discussed as part of LLD.
        }
    }
  ]
}


PUT <index-name>/_search?pipeline=my_pipeline
{
    "query": {
        "<new-compound-query-clause>": {
            "queries": [
                { /* neural query */ }, // this is added for an example
                { /* standard text search */ } // If a user want to boost some scores or update 
               // the scores he need to go ahead and do it in this query clause
            ],
            // The below items be a part of processor also
            "normalization-technique" : "min-max" // min-max etc.., Optional Object
            "combination" : {
                "algorithm" : "harmonic-mean", // all the values defined in #3 above, interleave, harmonic mean etc
                "parameters" : {
                    // list of all the parameters that can be required for above algo
                    "weights" : [0.4, 0.7] // a floating pointing array which can be used in the algorithm
                }
            }
        }
    }
}

Alternatives Considered

Alternative-1: Create a new phase in between query and fetch phase
The high level idea here is to create a phase which runs in between the Query and Fetch phase which will do the Normalization.
Pros:

  1. No specific pros that I can think for this approach apart from we will not have any dependency on Search Pipelines.

Cons:

  1. Currently there is not extension points in OpenSearch to create a phase, so we need to build everything from scratch.
  2. Problems will arrive in during implementation where code need to identify for which queries this new phase to run, and then we need implement some sophisticated logic to identify that.

Alternative-2: Create a Fetch Subphase which do the Normalization and score combination
OpenSearch provides an extension where plugins can add Fetch subphases which will run at the end when all Core Subphases are executed. We can create a Fetchsubphase that will do the normalization and score combination. But problem with this, as we have multiple sub-queries we need to change the interfaces to make sure that all the information required for Normalization needs to be passed in. This will result in duplicated information and multiple hacks to pass through the earlier fetch phases.
Pros:

  1. No new extension points needs to be created as adding new subphases in Fetch phase is already present.

Cons:

  1. Order of execution of fetch subphases is not consistent. It depends on which plugin got registered first and the fetch sub-phases of that plugin will run first. This will create inconsistency across clusters with different set of plugins. Code reference.
  2. There is a source subphase which gets the source info for all the docIds, running this before normalization will make OpenSearch get sources for document Ids which we will not be sending in response. Hence waste of computation.

Alternative-3: Use SearchOperationListeners interface
SearchOperationListener runs at a Shard Level and not the Coordinator Node Level. Please check this, and this code reference. Hence we cannot use SearchOperationListeners. As we need the normalization to be done at Coordinator Node Level.

High Level Diagram

Based on the above 3 directions, below is the high level flow diagram. The 2 subqueries are provided as an example. There will be a limit on how many subqueries a customer can define in the query clause. This max number for the subqueries we will be keeping is 10(There is no specific reason for this limit, just want to make sure that we have limits imposed to make sure to avoid long running queries leading to Circuit Breaker and Cluster failures).

Screen Shot 2023-02-27 at 4 48 36 PM

** There can be many sub-queries in the new compound query.**

Future Scope or Enhancements which will be picked up during next phases

Implementing Pagination in new Compound Query Clause

The proposed design doesn’t support for the Paginated Queries in the first phase to reduce the scope of the phase-1 launch. We also have not done deep-dive on how this can be implemented and what is the current solution of pagination.

Enhance the explain query functionality for new compound Query Clause

With the phase-1 implementation of the new query we won’t be providing the explain functionality for the query clause. Explain api provides information about why a specific document matches (or doesn’t match) a query. This is very useful api for customers to understand and do debugging.

Enabling Parallel/Concurrent Search on Segments functionality for new Compound Query

The idea here is to enable the parallel search for all the different queries provided in the Compound Query to improve the performance for this query. Parallel Search on Segments is already in sandbox for OpenSearch(https://github.com/opensearch-project/OpenSearch/tree/main/sandbox/plugins/concurrent-search).

Implement Script based Combination to allow customers to use any Score combination Techniques

The initial proposal provides customers only a set of functions to combine the scores with Script based combination customer can define custom scripts that can be used to combine the scores before we rank them.

Integrating the compound query to be written via Querqy etc to provide better experience for customers

The idea here is the new compound query clause can become overwhelming, hence we want to integrate it with different Query writing helpers like Querqy etc to facilitate easy query writings for customers.

Launch Plan

Below are some high level phased approach for building the feature. These are not set in stone and may change as we start making progress in the implementation

Phase-1

Given that we are defining new compound query clause for OpenSearch we will launch these features defined in this document and high level design under feature flag. High Level items:

  1. Customers can use new compound query to do the normalization, we will lay the ground work and low level interfaces for doing running the compound query.
  2. Adding support in OpenSearch to provide more than 1 QueryPhaseSearcher implementation
  3. Customer can use standard normalization and score combination techniques defined in the phase 1. See appendix section for the different normalization and score combination techniques that will be present in phase-1.
  4. Provide the first set of Performance testing results for customers with this new compound query.

Phase-2

The phase-2 will focus on these items:

  1. Solidify the compound query interfaces and do GA after taking customer feedbacks.
  2. We will add the capability for doing the Pagination queries for this new compound query clause.
  3. Add explain query functionality for new this new compound query clause.
  4. Integration in different Language specific opensearch clients.

Phase-3

The phase-3 will focus on these items:

  1. Enabling Parallel segments search for the new compound query.
  2. Implement Script based Combination to allow customers to use any Score combination Techniques

Phase-4

By this time we will have a good understanding of how customer are using this new compound query. This phase will start to focus on how we can now make it easier for customer to start using this new query clause. The below item helps us in doing that:

  1. Integrating the compound query to be written via [Querqy] (https://opensearch.org/docs/latest/search-plugins/querqy/index/) to provide better experience for customers

FAQ:

Does other products support combining scores at Global Level?

Yes, Elastic Search supports this feature. But it combines the results of K-NN query with the text match query only. It is not generic enough. Also ES doesn’t support normalization of scores globally. Reference: https://www.elastic.co/guide/en/elasticsearch/reference/master/knn-search.html#_combine_approximate_knn_with_other_features.
As per the documentation limitation is:
Approximate kNN search always uses the dfs_query_then_fetch search type in order to gather the global top k matches across shards. You cannot set the search_type explicitly when running kNN search.

Example Query:

POST image-index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "mountain lake",
        "boost": 0.9
      }
    }
  },
  "knn": {
    "field": "image-vector",
    "query_vector": [54, 10, -2],
    "k": 5,
    "num_candidates": 50,
    "boost": 0.1
  },
  "size": 10
}

Open Questions

Question1: Should we consider the case where subqueries are getting normalized using different techniques or customer only wants to normalize the single query and not the other queries?

After discussing, this is a very valid use case. But we don’t have enough data to prove this hypothesis. This will really depend on customer use case. I would suggest to start with lets do normalization on all the sub-queries as we can see from this blog that we should do normalization on all the sub-queries. Also as this is not a one way door.

Question2: How we are going to calculate the min score for the queries globally?

We will have a min score from different shards but as of now we don’t have a way to find the global min score for the K-NN queries. To do that we need to run exact K-NN. As of now I am trying to find a way to do this. For text matching, as we will iterate over the all the segments we will have min score. But more deep-dive is required the feasibility of the solution.

Next Steps:

  1. I will working on creating the POCs for the proposal to validate the understandings of Query Clauses, DocsCollector and SearchPhase Searcher etc.
  2. The API interface created for _search api is not final. This is more for understanding I have put my initial thought process. I will be creating more github issues to discuss on that in more details.

Appendix

What is Normalization?

Normalization is a data transformation process that aligns data values to a common scale or distribution of values.
Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data.

What are different ways to do the normalization?

1. y = (x – min) / (max – min)

Score Combination Techniques

Now you have 2 or more results. You need to combine both of them. There can be many way you can combine the results(Geometric Mean, Airthmatic Mean etc).

Approach 1: Normalized arithmetic mean

Assume we have 2 sets of results, resultsa​ and resultsb​. Each result has a score and a document id. First, we will only consider the intersection of results in a and b (i.e. resultsc​=resultsa​∩resultsb​). Then, each document in resultsc​ will have 2 scores: one from a and one from b. To combine the scores, we will first normalize all scores in resultsa​ and resultsb​, and then take the arithmetic mean of them:

score=(norm(scorea​)+norm(scoreb​))/2

Approach 2: Normalized geometric mean

Similar to Approach 1, but instead of taking the arithmetic mean, we will take the geometric mean:
score=(​norm(scorea​)∗norm(scoreb​))

Approach 3: Normalized harmonic mean

Similar to Approach1, but instead of taking the arithmetic mean, we will take the harmonic mean:

score=2/(1/norm(scorea​)+1/norm(scoreb​))

Approach 4: Normalized Weighted Linear Combination

Instead of taking the mean of the scores, we can just try different weights for each score and combine them linearly.

score=wa​∗norm(scorea​)+wb​∗norm(scoreb​)

Approach 5: Normalized Weighted Geometric Combination

Similar to above approach, but instead of combining with addition, we can combine with multiplication:

score=log(1+wa​∗norm(scorea​))+log(1+wb​∗norm(scoreb​))

This approach has previously been recommended for score combination with OpenSearch/Elasticsearch: elastic/elasticsearch#17116 (comment).

Approach 6: Interleave results

In this approach, we will produce the ranking by interleaving the results from each set together. So ranking 1, 3, 5, ... would come from resultsa​ and 2, 4, 6, ... would come from resultsb​.

Reference Links:

  1. Meta Issue for Feature: [META] Score Combination and Normalization for Semantics Search. Score Normalization for k-NN and BM25 #123
  2. Compound Queries: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/compound-queries.html
  3. Dis_max Query: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-dis-max-query.html
  4. DFS Query and Fetch: https://www.elastic.co/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch
  5. Querqy: https://opensearch.org/docs/latest/search-plugins/querqy/index/
  6. Science Benchmarks: https://opensearch.org/blog/semantic-science-benchmarks/
@navneet1v navneet1v self-assigned this Feb 28, 2023
@navneet1v navneet1v added RFC Features Introduces a new unit of functionality that satisfies a requirement Enhancements Increases software capabilities beyond original client specifications neural-search v2.8.0 labels Feb 28, 2023
@prasadnu
Copy link

prasadnu commented Mar 7, 2023

When the KNN query and the lexical search query has completely different set of documents (no common docs), then knn and BM25 scores are calculated for all the documents independently, normalised and combined for every document on the co-ordinator node, then sorted and rendered. Is my understanding right about how this feature will work ?

@navneet1v
Copy link
Collaborator Author

When the KNN query and the lexical search query has completely different set of documents (no common docs), then knn and BM25 scores are calculated for all the documents independently, normalised and combined for every document on the co-ordinator node, then sorted and rendered. Is my understanding right about how this feature will work ?

@prasadnu
So, even when there are common documents between KNN query and the lexical search query the normalization will happen for all the documents which are returned from all the shards, per query.

So when you say all the documents you mean all the documents returned from shards or all the documents present in cluster?

@prasadnu
Copy link

prasadnu commented Mar 7, 2023

@navneet1v Thanks for the response. Let me elaborate a little, Under this new feature, say, I have a compound search query with 2 queries inside, one doing the KNN search and another doing the lexical search which gets triggered in parallel. When the final resultant documents from both the queries are completely disjoint (d1,d2,d3,d4 for knn search and d5,d6,d7,d8 for lexical search), then both the knn and bm25 scores are calculated for all the resultant documents (d1,d2,d3,d4,d5,d6,d7,d8), normalised and combined at the global level, then sorted and rendered. Is this the desired outcome ?

@navneet1v
Copy link
Collaborator Author

navneet1v commented Mar 7, 2023

then both the knn and bm25 scores are calculated for all the resultant documents (d1,d2,d3,d4,d5,d6,d7,d8)

@prasadnu As per this proposal the ans to this question is no. But this is really interesting use-case. We can calculate the scores for all the documents(both lexical and K-NN) but my worry over here is will that be optimal and in worst case this can slow down the query. Please let me know if this kind of use case can be very prominent.

All other parts of your understanding is correct.

@prasadnu
Copy link

prasadnu commented Mar 8, 2023

@navneet1v If both the scores are not calculated for all the resultant documents (the union set of docs from knn and lexical search), How can the scores be combined ?

Will the compound query only take the common list of documents (the common docs that resulted in both the searches) and combine the scores ? If yes, then all the times, we cannot expect both the search queries to have some docs in common, there may be scenarios where neural search and the lexical search will give disjoint set of docs and in those cases we cannot combine the scores when we don't have the both the scores calculated for each resulted doc.

I have attached a screenshot on how possibly hybrid ranking can work for disjoint resultant docs from the different queries.

image

Please clarify.

@prasadnu
Copy link

prasadnu commented Mar 8, 2023

In addition to the above, if there are common docs as results from the queries, they can be always boosted at the top (based on their sum score) at the co-ordinator level

@navneet1v
Copy link
Collaborator Author

navneet1v commented Mar 8, 2023

@navneet1v If both the scores are not calculated for all the resultant documents (the union set of docs from knn and lexical search), How can the scores be combined ?

@prasadnu
so what I have proposed here is if a document is not present in one of the query then I will consider its score as 0 for that type of the query. To add more, doing the sum of scores is not the only way to combine the scores. Please refer the Appendix section of this document to understand what are the different ways in which scores can be combined. If you want to give weight to some score that can always be done. This is particular useful for those data sets where we believe results of 1 query is more useful than the other because of the type of the dataset.

The reason behind this, for a good enough size of corpus/index this is very less likely to happen. @MilindShyani for providing more details.

@SeyedAlirezaFatemi
Copy link

In approximate kNN, the min value might not be known unless an exact kNN search is run which might not be a good idea for large-scale kNN databases. I think for these cases there should be an option to provide a default minimum value (or maybe also a default max value) so the normalization can be done based on that. For example, if the metric is cosine similarity, we know that the min value is -1 so we can use that instead of an exact kNN search. Also, when there are documents that are not present in all queries, there should be a default score that would work as 0 in combining the scores.

I also wonder how problems like this can be handled since they come up in scenarios where approximate kNN is combined with other queries.

@navneet1v
Copy link
Collaborator Author

navneet1v commented Mar 13, 2023

@SeyedAlirezaFatemi
Thanks for your valuable suggestions. Please find my response below.

In approximate kNN, the min value might not be known unless an exact kNN search is run which might not be a good idea for large-scale kNN databases.

Yes, I am aware of this, we never plan to use exact K-NN. I see that there are some confusions getting created around this, I will go ahead and update the proposal to make it clear that we are not going to use exact k-NN. To solve the problem what we are proposing here is once all the shards finds out the doc Ids and return back to coordinator node we will find the min score of K-NN from those scores at the Coordinator node level. This will be a good approximation for min score of K-NN.

I think for these cases there should be an option to provide a default minimum value (or maybe also a default max value) so the normalization can be done based on that.

This is a good idea for min score, I will think over it and incorporate it as an alternative in the proposal.

Also, when there are documents that are not present in all queries, there should be a default score that would work as 0 in combining the scores.

For default score, for document which is not present in either of the query will be 0. I thought this will be explicit because this is how query clauses like should, works under bool. But I will go ahead and update the proposal to make it more clear. I hope this helps.

I also wonder how problems like this can be handled since they come up in scenarios where approximate kNN is combined with other queries.

For this, I have added a section for future enhancement. In the first phase we are not thinking to handle the paging queries.

@SeyedAlirezaFatemi
Copy link

Yes, I am aware of this, we never plan to use exact K-NN. I see that there are some confusions getting created around this, I will go ahead and update the proposal to make it clear that we are not going to use exact k-NN. To solve the problem what we are proposing here is once all the shards finds out the doc Ids and return back to coordinator node we will find the min score of K-NN from those scores at the Coordinator node level. This will be a good approximation for min score of K-NN.

I understand that pagination is not the priority here (though I think it's a crucial aspect of this feature that would enable it to be used in many products since paging is present in many use cases), but if we use the min score of kNN that we got from the first top k results, then if we increase this k (by going to next pages or some other scenario), that min score will change and that means the kNN scores after normalization will change and that affects the ranking.

Let's assume we use this formula for score normalization of kNN scores:
x_normalized = (x – min) / (max – min)
and we get the top 10 scores from the index. In the worst case, if all the top 10 docs get the same score of 0.8, then both min and max of our score will be 0.8 and this formula will have a division by 0. Or assume the top 10 scores are between 1 and 0.9 and then the scores for the rest of the documents fall to 0. This changes the scores after normalization drastically if we increase the k from 10 to a higher number.

I think this shows that having that default min score can be really important.

@MilindShyani
Copy link

I think it is important to consider why normalization is used in the first place. Normalization is useful when combining results from different retrievers (for us that's BM25 and kNN).

If we are just using kNN it doesn't matter what normalization we use (as long as it is a monotonic function). For instance in the example above one doc has a score 1 and nine docs have 0.9. After normalization one doc has 1 and the nine docs have score 0. Note that the ranking has not changed. Since 9 documents having a tie with score 0.9 is the same as 9 documents having a tie with score 0. The real problem comes when we want to combine. There the score being 0.9 or 0 does lead to a difference.

But during combination, we do not know what is the right way to decide the min score. What if the default min score is bigger than some score in which case we can get negative scores. Many transformers use dot product for retrieval and the score has no definite range -- how do we select the min/max there? How do we decide the min/max scores for kNN and BM25 -- should they be the same or different? These are tricky questions and the answers depend on what dataset is being used.

What we do instead is appeal to some weak law of large numbers. If we have enough documents per shards and each shard has a diverse enough set of documents, the chances of us getting the cases you mentioned (which are indeed problematic for combination) are pretty slim. In other words, on average we expect the results to be fair.

Hope that addresses some of your concerns. Please feel free to ask clarifications/comments.

@SeyedAlirezaFatemi
Copy link

If we are just using kNN it doesn't matter what normalization we use (as long as it is a monotonic function). For instance in the example above one doc has a score 1 and nine docs have 0.9. After normalization one doc has 1 and the nine docs have score 0. Note that the ranking has not changed. Since 9 documents having a tie with score 0.9 is the same as 9 documents having a tie with score 0. The real problem comes when we want to combine. There the score being 0.9 or 0 does lead to a difference.

Thanks for the response. Sorry if I wasn't clear with that example. I meant that in the context of using kNN with some text match query. As you mentioned, if there is only a single kNN query, normalization doesn't make sense. But if there was also a text query combined, there would be a problem.

But during combination, we do not know what is the right way to decide the min score. What if the default min score is bigger than some score in which case we can get negative scores. Many transformers use dot product for retrieval and the score has no definite range -- how do we select the min/max there? How do we decide the min/max scores for kNN and BM25 -- should they be the same or different? These are tricky questions and the answers depend on what dataset is being used.

You are right that having this default min score isn't the best option for all scenarios. But at least for BM25 and cosine similarity, it does make sense (0 and -1 being the min for each respectively). Maybe it would be nice to have the feature to optionally provide the min/max score for each query part and if it is not defined in the query, then the min/max would be calculated using the retrieved data as you proposed.

@navneet1v
Copy link
Collaborator Author

You are right that having this default min score isn't the best option for all scenarios. But at least for BM25 and cosine similarity, it does make sense (0 and -1 being the min for each respectively). Maybe it would be nice to have the feature to optionally provide the min/max score for each query part and if it is not defined in the query, then the min/max would be calculated using the retrieved data as you proposed.

@SeyedAlirezaFatemi Thanks for providing the feedback. I will keep this in mind when developing the API interface for the new Query type. I would recommend you creating a feature request for this so that we can discuss more on the use-case and why it is needed.

@SeyedAlirezaFatemi
Copy link

@navneet1v
I would say any product search can be a good example use case for pagination with this feature. Imagine with the user query, you want to search over product image embeddings (k-NN) and also match their name and description (text match query) and normalize and sum these scores together. If you want to have such a feature for an online store, pagination would be very important if we want the user to scroll through the products.

We can of course just show the n of nearest neighbors and then just show the results from text matching for the rest (in this scenario one would make a query to k-NN and populate the first n items and then start over with another text matching query and fill the rest of the pages with that). But this is inferior to combining the queries together with normalization and weighting.

@navneet1v
Copy link
Collaborator Author

Updated the Description to add the Science Benchmarks link for reference: https://opensearch.org/blog/semantic-science-benchmarks/

@consulthys
Copy link

consulthys commented Sep 5, 2023

@navneet1v Thanks for leading this effort! Much needed if we want to be able to blend lexical and semantic search results together in a meaningful way.

You reference this keystone paper multiple times in this RFC, yet I don't see any mention of RRF (Reciprocal Rank Fusion) anywhere as a potential solution to blend results together, even though Elastic has demonstrated that it is a very viable solution to not have to mess with score normalization. Note that while "viable" doesn't mean "best", RRF is very simple to use and implement, which is also (very) important in terms of usability for the end users. PS: Even searching org-wide for rrf or reciprocal rank fusion in Github doesn't yield a single result 🤔

Can you provide any insights on why RRF is not being considered at all in this context even though OS is trying to solve the same issue that ES partly solved already?

Thanks much in advance!

@navneet1v
Copy link
Collaborator Author

@navneet1v Thanks for leading this effort! Much needed if we want to be able to blend lexical and semantic search results together in a meaningful way.

You reference this keystone paper multiple times in this RFC, yet I don't see any mention of RRF (Reciprocal Rank Fusion) anywhere as a potential solution to blend results together, even though Elastic has demonstrated that it is a very viable solution to not have to mess with score normalization. Note that while "viable" doesn't mean "best", RRF is very simple to use and implement, which is also (very) important in terms of usability for the end users. PS: Even searching org-wide for rrf or reciprocal rank fusion in Github doesn't yield a single result 🤔

Can you provide any insights on why RRF is not being considered at all in this context even though OS is trying to solve the same issue that ES partly solved already?

Thanks much in advance!

Hi @consulthys, thanks for providing this info. Based on my reading and knowledge the RRF is supported in OpenSearch. Similar to what is provided in the blog.

This feature actually tries to tackle one more key idea which is putting the weights at same scale for 2 different types of queries by looking at whole corpus of results from different shards. Basically providing a support for doing score normalization. Now once we have scores normalized there can be different ways to combine the scores. Right now we are enabling the linear score combination, but I don't see a reason why this cannot be extended to have support for RRF combination.

I am adding @MilindShyani on this thread to provide more details.

Feel free to correct me if there is something is wrong.

@MilindShyani
Copy link

@consulthys RFF is indeed a simple yet effective protocol with non-trivial benefits. I believe it is possible to execute rank fusion in most query DSL's out of the box. From what I gather, it should be easy to implement it within OpenSearch. But as you noted, we should highlight this somewhere (perhaps in a tutorial/blog), so that users can start executing hybrid searches right away.

However, given the diversity of search domains there does not exist a one-fit-all solution. And for many domains, as has been shown in the literature, score combination can perform better. Combination requires score normalization, which is what this RFC is about.

@consulthys
Copy link

Thanks @navneet1v and @MilindShyani for chiming in and sheding some light on this matter.

I know there's no one-size-fits-all solution and I'm not diminishing the benefits from this RFC whose goal is perfectly clear (score normalization), but if RRF is supposedly already supported in OpenSearch, yet there is no single mention of it anywhere, I think it is a BIG omission that should be fixed as soon as possible.

If any of you or anyone else could show an example of how RRF can be implemented (as of 2.9), I'm curious and it would be awesome, thanks in advance.

@consulthys
Copy link

@navneet1v @MilindShyani
I'm also curious to know if this is supposed to make it into the 2.10 release due next Monday (Sep 11th) as planned in #123?

@navneet1v
Copy link
Collaborator Author

@navneet1v @MilindShyani
I'm also curious to know if this is supposed to make it into the 2.10 release due next Monday (Sep 11th) as planned in #123?

Yes this feature is planned for 2.10. The code is completed and merged. We are doing some testing and benchmarks as of now.

@consulthys
Copy link

@MilindShyani

I believe it is possible to execute rank fusion in most query DSL's out of the box. From what I gather, it should be easy to implement it within OpenSearch.

Can you show an example of how RRF can be implemented (as of 2.9), or refer me to someone who knows how to do it?
I'm curious and it would be awesome, thanks in advance.

@MilindShyani
Copy link

MilindShyani commented Sep 13, 2023

Of course! Let me try,

Using a python client (which unfortunately is not ideal) we could do something like,

keyword_query = {
    "query": {
        "match": {
            "text_field": "example text"
        }
    }
}

neural_query = {
    "query": {
        "knn": {
            "vector_field": {
                "vector": [0.1, 0.2, 0.3],  # Replace this with the actual vector
                "k": 10
            }
        }
    }
}
response_keyword = os.search(index="your_index", body=keyword_query)
response_neural = os.search(index="your_index", body=neural_query)
def reciprocal_rank_fusion(rank1, rank2, p=10):
  
    rrf_score = {}
    for i, doc_id in enumerate(rank1):
        rrf_score[doc_id] = rrf_score.get(doc_id, 0) + 1 / (p + i)
    for i, doc_id in enumerate(rank2):
        rrf_score[doc_id] = rrf_score.get(doc_id, 0) + 1 / (p + i)
    fused_rank = sorted(rrf_score.keys(), key=lambda x: rrf_score[x], reverse=True)
    return fused_rank

# In the above, p is a hyper-parameter that we can set by running experiments on some test queries

# Now create ranked lists from the responses
rank_keyword = [hit["_id"] for hit in response_keyword["hits"]["hits"]]
rank_neural = [hit["_id"] for hit in response_neural["hits"]["hits"]]
fused_rank = reciprocal_rank_fusion(rank_keyword, rank_neural)

May be @navneet1v could help us how to do this with query DSL?

@navneet1v
Copy link
Collaborator Author

Resolving this github issue as the changes for RC of 2.10 is finalized and merged. Please create a github issue if there are any further questions.

@consulthys
Copy link

Hi @MilindShyani

From what I gather, it should be easy to implement it within OpenSearch

Thank you for the example you shared, but that's a client-side solution outside of OpenSearch not within. I thought you referred to a magical trick that I didn't know about and allowing you to do it in a script or something.

Anyway, I'm eager to see OS 2.10 being released soon (maybe next Monday Sep 25th?) to try out hybrid search.

Thanks for your hard work

@navneet1v
Copy link
Collaborator Author

@consulthys yes the release for 2.10 is on Monday September 25th.

Thank you for the example you shared, but that's a client-side solution outside of OpenSearch not within. I thought you referred to a magical trick that I didn't know about and allowing you to do it in a script or something.

Currently with opensearch if we need to use RRF we need to do it outside of OpenSearch. But with this Normalization Feature, we added some generic Components like SearchPhaseResults Processor which is also going live in 2.10, I see 2 ways to do this:

  1. we should be able to extend that interface to create a processor that does this using Hybrid Query.
  2. Or we can write a brand new processor that can do RRF.

If you are interested in that feature please feel to cut a github issue or contribute. I see we have all the things available in the plugin to build RRF technique.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement neural-search RFC v2.10.0 Issues targeting release v2.10.0
Projects
None yet
Development

No branches or pull requests

5 participants