Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark with nodesketch? #2

Open
jianshu93 opened this issue Jul 17, 2024 · 2 comments
Open

benchmark with nodesketch? #2

jianshu93 opened this issue Jul 17, 2024 · 2 comments

Comments

@jianshu93
Copy link

jianshu93 commented Jul 17, 2024

Dear DotHash team,

Thanks for a nice paper, I have 2 questions related to both MinHash and link prediction:

  1. What MinHash was used, on permutation hashing with optimal densification (http://proceedings.mlr.press/v70/shrivastava17a.html) or the old original MinHash from Broder 1997. The former is at least 10 times faster than the old one, with exactly the same RMSE or error, converged to 0 as sketch size increases. Dothash can estimate set intersection without bias, how about cardinality of each separate set (we need this for Jaccard according to inclusion and exclusion rule). Cardinality estimation based on HyperLogLog et.al., has a Camer-Rao lower bound which is much worse than Minhash for final Jaccard similarity estimation. Do you assume Cardinality of set A and B known?

  2. NodeSketch (https://dl.acm.org/doi/abs/10.1145/3292500.3330951), use probminhash (https://ieeexplore.ieee.org/abstract/document/9185081, they called it HistoSketch, but they did not realize the minhash they use is approximating the J_p but not weighted Jaccard) for comparing node similarity (considering weights), This is also very fast and efficient for link prediction tasks. We have a Rust implementation here: https://github.com/jean-pierreBoth/graphembed

Thanks,

Jianshu

@mikeheddes
Copy link
Owner

Hi Jianshu,

Thank you for reaching out.

  1. For the MinHash implementation we used the original one by Broder. We later learned about the one permutation hashing method which indeed gives a significant speed up over the original MinHash. However, both methods are not able to estimate metrics like Adamic-Adar and Resource Allocation which can be estimated with our method. And yes, we assume that the cardinality of both sets is known. This is because our motivation is to reduce the cost of the many pair-wise comparisons, any one-time cost on each set is considered negligible.

  2. Thank you for bringing this work to our attention, it looks very interesting.

Let me know if you have any other questions.

@jianshu93
Copy link
Author

Hi @mikeheddes,

The beauty of one permutation hashing (with appropriate densification) is that hashes that collide are at the same position of sketch vector from 2 sets, so a hamming similarity can be computed extremely fast via bit XOR and population count, see our BinDash 2 paper (it was for comparing genomes but the idea is the same). overall we saw a 10 to 50 speedup compare to the old one where a priority queue is needed to sort hashes and 2 for loops to find common hashes between sketches.

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants