-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Insufficient number of hits for nested knn queries with efficient filter #2347
Comments
Further investigation into the code may be required, but based on the reported issue, it appears that post-filtering is being utilized instead of efficient filtering internally. |
@CorentinLimier Could you try the same query without using neural to see if this is knn issue or neural issue? |
@heemin32 Thanks for your help ! About this :
Indeed that could be it but what's weird is that for some other vectors (with exact same query/index & filter), we do have the correct value of hits at the end which would be highly unrealistic with post-filtering. So for some vectors it would use post-filtering, and for some efficient ?
I did use a query without neural with same result (posted in my initial message). Or did you mean something else than the two queries posted ? Here the query without neural
Results are exactly the same for each value of k : 6 hits for k=38, 32 hits for k = 1000, all hits for k = 10000 Thanks a lot 🙏 |
Oh. You were talking about value in hits. Could you also check if it also only return that Also, it would be nice if you could provide a reproducible steps with smaller data set. |
I indeed have the same number of value returned than that hit number if size = k. Ex : GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
"size": 38,
"_source": {
"excludes": [
"metadataEmbedding"
]
},
"query": {
"nested": {
"path": "metadataEmbedding",
"query": {
"knn": {
"metadataEmbedding.knn": {
"vector": [
...
],
"k": 38,
"filter": {
"term": {
"accountId": "..."
}
}
}
}
}
}
}
} Returns 6 hits (instead of k=38) and 6 documents (instead of size=38).
I will try to create a reproducible example, I can understand that it will help. I wanted to make sure first that my issue was not a mistake from me or a misunderstanding of the doc. From your understanding of the mapping and queries, are we aligned on the fact that I should expect k documents even with nested fields and efficient filtering ? Thanks a lot |
The exact hit number is a little more complex than min of k and max doc. It is sum of (min of k and max doc in a segment) for all segments. Still, the 6 hits for k = 38 and 30 hits for k = 1000 is not an expected behavior. It could be an issue with lucene engine. Could you test it with faiss engine if possible? |
@heemin32 Ok, I will try with faiss (by changing the engine attribute in mapping right ?). Will probably be able to do so starting from beginning of January. I will try to create a reproducible example as well. Does it make sense to have this lucene issue only for nested structures ? Thank you very much, will give you more details once I'll be able to work on it 🙏 |
Yes |
@buddharajusahil can you please take a look into this issue. |
Sure @navneet1v , please assign me this task. |
Hello, I didn't manage to create a reproducible example with a smaller dataset. Not sure if it's a volume issue or not, and it's quite difficult to reproduce since in production, for some vectors it works as expected and for others not. My next tentative will be to use faiss engine as suggested by @heemin32 and see if I see differences in production. We monitor the number of results for each request so I will be able to monitor if the number of results is closer to what I expect. If you have any idea on what could be the root cause of the bug and help me reproducing the issue with a smaller dataset, I would be happy to help. Same if I can provide more details on the current situation in production. Thanks |
Hi @CorentinLimier I think this is related to an overall problem with efficient filtering. Can you try one thing, retry this tentative, but instead with 2 shard count specified in index settings. Then, try running the search multiple times, I believe you will get inconsistent results. |
@buddharajusahil I get consistent results even with 2 shards with this sample of data. In production, we actually have only one shard, but also one replica. I also tried these settings with this sample of data and couldn't reproduce :/ |
We switched the engine from faiss to lucene and results are by far better.
Note that for account1 provided 34 results while other words provided 38. Searches on account3 worked only for word guidance. For account6, search on quiet didn't work well With faiss :
Note that for account2, I also had issues with faiss where I got only 18 results for searches on trustworthy and collaboration on account2, but it seems to be very rare to have less than k results (I understand that we have often more because faiss provides k * segment * shards) Note as well the distribution of the nb of documents per account :
I thought at first that account with small amounts of documents would be the ione impacted but seems not to be the case. Still trying to reproduce with dummy data but for now without success |
This issue might be related with #2359? |
@buddharajusahil could you answer to this question? |
@CorentinLimier @heemin32 I don't think this is the same issue as #2359 . I believe that issue can only occur in a multi shard setup, not on single shard. |
Hi @CorentinLimier do you know if this problem only started arising in version 2.17.1? Also, on these certain queries produce incorrect results, are they consistent and get the same result every time? Also have you tried different filters to see if that produces different results for the same query text? Appreciate the info thus far! |
@buddharajusahil Thank you for your answer !
Yes at least for a certain amount of time but since we index new documents every day I believe it can change. But without indexing new documents I had same results with same queries executed multiple times even with request_cache set to false
Yes some query_text worked fine with some filters (producing results) and not on others (0 result while we expect some). On example provided above
I don't know if it started with 2.17.1 or if we had it with a prior version. But actually, updating to 2.18.0 seems to fix the issue. Only condition is that we reindex the documents (if we just upgrade the version without droppping the index, the issue remains). Of course, I tried to recreate from scratch the index before on 2.17 and issue remained with this version. So I wonder if the issue on 2.17 is on the hnsw side when indexing 🤔 Will keep an eye on this since we will upgrade our production cluster to 2.18.0 next week Thanks |
Hello 👋
What is the bug?
We use an index to store text documents for semantic search purpose. The text being long, we chunk it in paragraph to embed it using
all-MiniLM-L6-v2
model. Each chunk being stored in that nested field of the document.Each document has also an account_id attribute that we use when querying (efficient filtering).
Then we do approximative knn queries with lucene hnsw.
From these documentations :
I expect when executing a knn query on this nested field with efficient filter to get at least n hits, n being the minimum between k and the number of documents that match the efficient filter.
But for some specific input
vector
orquery_text
, we get less than n hits, and sometimes even 0. For the same filter with a different query, we get the correct n hits.We have two other indices without nested field (only one vector per document) with the same efficient filter and it works as expected.
Seems similar to this #2222 or #2339 except the efficient filtering is as simple as a term filter.
How can one reproduce the bug?
Error happens on specific queries so it's hard to reproduce.
Here is the mapping of the index :
Here is the query :
For k = 38, I get 6 hits
But for k = 1000 I get 32 hits, and k = 10000 (max value) 232 hits.
For another
query_text
value, I have different results where hits is always = k (or the max of documents that match filter which is 232)I have the same results when converting first the text in vector and use directly the vector without the neural instruction :
What is the expected behavior?
Getting n hits, n being the minimum between k and the number of documents that match the efficient filter.
What is your host/environment?
Do you have any additional context?
Here is the result of
GET /_plugins/_knn/stats?pretty
on the node :Any idea on what could be the issue here ? Am I right to expect k hits for nested fields with efficient filter ?
Thanks for your help.
The text was updated successfully, but these errors were encountered: