You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for a nice paper, I have 2 questions related to both MinHash and link prediction:
What MinHash was used, on permutation hashing with optimal densification (http://proceedings.mlr.press/v70/shrivastava17a.html) or the old original MinHash from Broder 1997. The former is at least 10 times faster than the old one, with exactly the same RMSE or error, converged to 0 as sketch size increases. Dothash can estimate set intersection without bias, how about cardinality of each separate set (we need this for Jaccard according to inclusion and exclusion rule). Cardinality estimation based on HyperLogLog et.al., has a Camer-Rao lower bound which is much worse than Minhash for final Jaccard similarity estimation. Do you assume Cardinality of set A and B known?
For the MinHash implementation we used the original one by Broder. We later learned about the one permutation hashing method which indeed gives a significant speed up over the original MinHash. However, both methods are not able to estimate metrics like Adamic-Adar and Resource Allocation which can be estimated with our method. And yes, we assume that the cardinality of both sets is known. This is because our motivation is to reduce the cost of the many pair-wise comparisons, any one-time cost on each set is considered negligible.
Thank you for bringing this work to our attention, it looks very interesting.
The beauty of one permutation hashing (with appropriate densification) is that hashes that collide are at the same position of sketch vector from 2 sets, so a hamming similarity can be computed extremely fast via bit XOR and population count, see our BinDash 2 paper (it was for comparing genomes but the idea is the same). overall we saw a 10 to 50 speedup compare to the old one where a priority queue is needed to sort hashes and 2 for loops to find common hashes between sketches.
Dear DotHash team,
Thanks for a nice paper, I have 2 questions related to both MinHash and link prediction:
What MinHash was used, on permutation hashing with optimal densification (http://proceedings.mlr.press/v70/shrivastava17a.html) or the old original MinHash from Broder 1997. The former is at least 10 times faster than the old one, with exactly the same RMSE or error, converged to 0 as sketch size increases. Dothash can estimate set intersection without bias, how about cardinality of each separate set (we need this for Jaccard according to inclusion and exclusion rule). Cardinality estimation based on HyperLogLog et.al., has a Camer-Rao lower bound which is much worse than Minhash for final Jaccard similarity estimation. Do you assume Cardinality of set A and B known?
NodeSketch (https://dl.acm.org/doi/abs/10.1145/3292500.3330951), use probminhash (https://ieeexplore.ieee.org/abstract/document/9185081, they called it HistoSketch, but they did not realize the minhash they use is approximating the J_p but not weighted Jaccard) for comparing node similarity (considering weights), This is also very fast and efficient for link prediction tasks. We have a Rust implementation here: https://github.com/jean-pierreBoth/graphembed
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: