-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pagination inconsistency when combining kNN search with other fields #791
Comments
@ymartin-mw Thanks for your response. |
Hi @SeyedAlirezaFatemi If possible can I get a look into the the data set if possible which can help me reproduce the issue rather than me creating a dummy data set. |
@navneet1v I will share some data that reproduces this issue soon |
This mapping: index_body = {
"settings": {
"index.knn": True,
"index.knn.algo_param.ef_search": 64,
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
"analyzer": {"default": {"type": "standard", "stopwords": "_english_"}}
},
},
"mappings": {
"properties": {
"title": {"type": "text"},
"embedding": {
"type": "knn_vector",
"dimension": 512,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {
"ef_construction": 64,
"m": 16,
},
},
},
}
},
} With this query: def query_opensearch(
title: str,
embedding: list[float],
max_returned: int = 20,
offset: int = 0,
) -> dict:
page = int(offset / max_returned)
knn_k = (page + 1) * max_returned
query = {
"size": max_returned,
"from": offset,
"query": {
"bool": {
"should": [
{
"knn": {
"embedding": {
"vector": embedding,
"k": knn_k,
},
},
},
{
"match": {"title": title},
},
]
}
},
"_source": {
"includes": ["title", "embedding"],
"excludes": [],
},
}
return client.search(body=query, index=index_name) And this data: rep.zip If you go through the first 10 pages with a page size of 5, you will see duplicates. @navneet1v I hope this is enough to reproduce the issue. Let me know if you need anything more. |
@navneet1v Were you able to reproduce the issue? How can this problem be fixed? How do people use the hybrid search with pagination in production? Does Elasticsearch also have a similar issue? Just setting the |
@SeyedAlirezaFatemi I didn't get time to reproduce the issue with your dataset as I was struck in some other tasks. But I did get time to look into how BM-25(keyword search) pagination is working and how k-NN Search pagination can work. Please note that this is initial investigation.
Atleast, I have not stumble across a use case where customers are using hybrid search with pagination. But with Neural Search plugin, we started giving this capability to OpenSearch customers. Documentation
Elastic works in a very different manner from OpenSearch K-NN. You can view ans for this here. So I am not sure if they will face this same issue or not. Having said all of the above, I am still doing experiments for 4, 5. As in theory they should work. You can try 4 on your side and can see if you still see similar issues. |
From a problem fix standpoint, I would probably call this as what is the right way to paginate on k-NN results. As the traditional ways how we use to do in OpenSearch don't work because of the algorithm and data structures used for k-NN.
Yes increasing the value of K has potential to increase the latency, but if you go deep in pages even for keyword search the latency and memory tend to increase. |
Yes, you're right. I tested with keeping
Correct me if I'm wrong but I think it's the number of segments that matters not the number of shards. I did a test on this where I had only one shard with some segments and when I set My main use case with this type of query is to combine traditional text search with semantic search. I'm not using the neural search plugin since I think that only supports embedding text data and I have other data modalities. Pagination is a must in my case but I think it will be okay to just set |
Thanks for confirming this. I will create a github issue for fixing the documentation of k-NN.
Thanks for showing interest in that RFC. If you can provide some details about your use case on the RFC will be great and why pagination is must feature for you. This will help us prioritize the pagination use-case. You can still give weights to your queries using function_score, but normalization is not supported and will come as a feature as part of the RFC.
Yes thats right for Nmslib and Faiss engine. Let me check that for Lucene engine also (ref: #682). |
So, I did the deep-dive on how Lucene does k-NN internally and how it returns the nearest neighbors. Yes it is different from nmslib and faiss. But the way to do pagination will remain same, which is setting higher value of Simplified Explanation:
Explnation in Depth: But what Lucene K-NN does is it gets the top k results from all the segments during rewrite and merge them to form top k results. These new K results are written down (what they exactly say is cached) in another query called as DocAndScoreQuery which is nothing but stores the doc id and its score. Hence Lucene K-NN query is finally getting written down to a DocAndScoreQuery which has at max k results per shard. Till now Collector has not come into picture which actually collects the docIds from each shard. This whole re-write of query(this is not opensearch querybuilder rewrite, this is at shard level) happen twice, once during the query preprocess of OpenSearch(here open search specifically calls query rewrite ) and then actual search(done in Lucene IndexSearch), we don’t have control here. Note: Once the query is written in its most primitive form it doesn’t get re-write. Coming back, when TopDocsCollector(having Priority queue of max size as from + size) start collecting the doc, lucene k-NN query has been written down to DocAndScoreQuery and hence collector at most can collect K docs per shard, because this is what is present with DocsAndScoreQuery at max but this not true for faiss and nmslib engine. documents returned from a shard for Lucene K-NN = size > k ? k : size Now at coordinator node level total documents = number of shards * (documents returned from a shard for Lucene K-NN). Then coordinator node just picks up total number of documents = size. |
Removing bug tag and will create a github issue to fix the documentation for K-NN on how to do pagination, and see if we can bring in parity between different engines. |
What is the bug?
Assume you have the following index:
And you have this query which combines the approximate kNN with a match query in a weighted manner:
When paginating using this query, there will be duplicate and missing results. This is an example of what can happen:
Assume we have these two documents in the index.
doc1: text score: 4.1 + knn score: 0.4 = total score: 4.5
doc2: text score: 3.7 + knn score: 0.5 = total score: 4.2
For the first query, we set page_size=1 and from=0 (knn_k=1) and we get doc2 as the first page because for the first page, only doc2 will come out of the knn part of the boolean query and we will have these scores for the documents:
doc1: text score: 4.1 + knn score: 0 = total score: 4.1
doc2: text score: 3.7 + knn score: 0.5 = total score: 4.2
Now, if we go to the next page, meaning we set page_size=1 and from=1 (knn_k=2), we again get doc2 as the result. Because now both doc1 and doc2 will show up in the knn part of the query and we get these scores:
doc1: text score: 4.1 + knn score: 0.4 = total score: 4.5
doc2: text score: 3.7 + knn score: 0.5 = total score: 4.2
so since we are asking for the second page (from=1), we again get doc2.
We are missing doc1 completely and getting doc2 twice. This problem happens for higher values of page_size too.
This happens if you combine a kNN query with another query (no matter if kNN or text query) in a boolean should.
What is the expected behavior?
Consistent paging without duplicate documents on different pages or missing documents in the results as a consequence.
What is your host/environment?
Ubuntu 22.04.2 LTS with OpenSearch 2.6.0 and the k-NN plugin.
Do you have any screenshots?
Here's another example
The text was updated successfully, but these errors were encountered: