-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update lshforest.py #215
base: master
Are you sure you want to change the base?
Update lshforest.py #215
Conversation
In the "query" function, I changed the "results" variable type from set into dictionary. Since the order is important, set data structure does not preserve the order unlike dictionary. The use is only for the dictionary keys which is eventually returned as a list.
Thank you! Your observation indicates that we should order the results by increasing order of their first appearance. What people used to do typically after retrieving the top-k results is to recalculate the estimated Jaccard of the returned results using MinHash stored on the side, and then re-rank the results according to the estimated Jaccard. Does the new implementation remove the needs to re-rank the results? What is the performance implication of using dictionary keys over set? There is a benchmark suite in |
Thank you! I am not able to answer your questions for now, because I have not done the tests yet. Best! |
@IbraheemTaha this is interesting. Is there a specific guideline around what is the smallest prefix at which point we should stop probing the hash table? I guess at some point the accuracy will be quite bad. |
@ekzhu I cannot tell for now. However, the search proposed in the paper is done in two phases.
If your question is how to stop the search before reaching out the root, then there should be kind of a flag to stop based on the number of points in the results meanwhile searching and stop when the result set equals to k, more precisely in the function _query. I hope this answers your question. |
@IbraheemTaha Sorry for the extremely belated response. My question was how does your change compare to the approach of retrieving the top-k and then re-rank using the MinHash of the results? There is a recent PR #234 added new functionality to retrieve MinHash from LSHForest index. |
In the "query" function, I changed the "results" variable type from set into dictionary. Since the order is important, set data structure does not preserve the order unlike dictionary in python.
The use is only for the dictionary keys which is eventually returned as a list.
Example:
data_sets = [
[1, 2, 3, 4, 5],
[4, 5, 6, 7, 8],
[2, 3, 5, 6, 7]
]
query_set = [1, 2, 3, 4, 5, 6]
The actual Jaccard between query_set and each of the data sets as follows:
(0) 0.8333333333333334
(1) 0.375
(2) 0.5714285714285714
Therefore the returned order based on the actual Jaccard score is [0,2,1]
However, the original implementation with set data structure returns [0,1,2], although the search function is correct.
After my enhancement, the returned list is [0,2,1] (more accurate than before).
This will significantly improve the implementation of LSH-Forest.