Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how well it performs on high dimensional data, such as SIFT #39

Open
wlzhao22 opened this issue Aug 8, 2024 · 1 comment
Open

how well it performs on high dimensional data, such as SIFT #39

wlzhao22 opened this issue Aug 8, 2024 · 1 comment

Comments

@wlzhao22
Copy link

wlzhao22 commented Aug 8, 2024

Thanks for your work! I am just wondering how well it works on high dimensional data, such as SIFT or various deep features. Comparing with Hilbert curve, does it hold similar property in preserving the locality of the high dimensional points. That is, when two points a, b \in R^d are neighbors, they will be neighbors as well when they have been converted to one-dimensional values.

@tzaeschke
Copy link
Owner

I assume you are interested in KNN query performance? It works "reasonably" well with high dimensional data (I have tested the Java implementation with 1000 dimensions and the C++ implementation with 60 dimension).

Short answer

You will have to measure it. I think there is a good chance the PH-tree works very well, it maybe 20% slower than other or equally likely considerably faster.

Long answer

There are several points to this:

  • How it exactly behaves depends on the dataset
  • How well it works depends also on the operations you want to perform. The PH-tree excels at adding/moving/removing entries from the dataset. KNN-queries are pretty fast but how it compares to other data structures, I don't exactly know
    • Until version 2.7.0, KNN queries were roughly on-par with other datastructures I tested (KD-tree, quadtree, R*tree, STR-tree) except for the CoverTree (the CoverTree has very fast KNN, but takes very long to build and is immutable).
    • With version 2.8.0, I reimplemented kNN queries. They seemed to perform considerably better than before, but I haven't actually made a thorough comparison.
    • Other operations: The PH-tree tends to be much faster than other data structures when you want to update the data (i.e. move or remove entries).

I have put some of my measurements (up to 40 dimensions) in a document: https://github.com/tzaeschke/TinSpin/blob/master/doc/benchmark-2017-01/Diagrams.pdf:

  • These measurement were made a long time ago (2016/2017) before the new kNN implementation, with Java 8, and on older hardware. Things may have changed a lot since then.
  • The measurements you are interested in a probably Fig. 32 - Fig. 35 which measure 1-NN and 10-NN queries on point datasets (ending with -P) with up to 40 dimensions. CU and CL are synthetic datasets as described in the document. There are two PH-trees in these measurements: PH and PHM.

If your are happy with approximate KNN, you can also have a look at https://github.com/erikbern/ann-benchmarks

Hilbert Curves

The PH-tree follows a Z-curve (Morton order) which has almost the same locality preserving features that Hilbert curves have, however z-curves a much easier and faster to calculate than Hilbert curves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants