how well it performs on high dimensional data, such as SIFT #39

wlzhao22 · 2024-08-08T03:53:29Z

Thanks for your work! I am just wondering how well it works on high dimensional data, such as SIFT or various deep features. Comparing with Hilbert curve, does it hold similar property in preserving the locality of the high dimensional points. That is, when two points a, b \in R^d are neighbors, they will be neighbors as well when they have been converted to one-dimensional values.

tzaeschke · 2024-08-08T06:49:24Z

I assume you are interested in KNN query performance? It works "reasonably" well with high dimensional data (I have tested the Java implementation with 1000 dimensions and the C++ implementation with 60 dimension).

Short answer

You will have to measure it. I think there is a good chance the PH-tree works very well, it maybe 20% slower than other or equally likely considerably faster.

Long answer

There are several points to this:

How it exactly behaves depends on the dataset
How well it works depends also on the operations you want to perform. The PH-tree excels at adding/moving/removing entries from the dataset. KNN-queries are pretty fast but how it compares to other data structures, I don't exactly know
- Until version 2.7.0, KNN queries were roughly on-par with other datastructures I tested (KD-tree, quadtree, R*tree, STR-tree) except for the CoverTree (the CoverTree has very fast KNN, but takes very long to build and is immutable).
- With version 2.8.0, I reimplemented kNN queries. They seemed to perform considerably better than before, but I haven't actually made a thorough comparison.
- Other operations: The PH-tree tends to be much faster than other data structures when you want to update the data (i.e. move or remove entries).

I have put some of my measurements (up to 40 dimensions) in a document: https://github.com/tzaeschke/TinSpin/blob/master/doc/benchmark-2017-01/Diagrams.pdf:

These measurement were made a long time ago (2016/2017) before the new kNN implementation, with Java 8, and on older hardware. Things may have changed a lot since then.
The measurements you are interested in a probably Fig. 32 - Fig. 35 which measure 1-NN and 10-NN queries on point datasets (ending with -P) with up to 40 dimensions. CU and CL are synthetic datasets as described in the document. There are two PH-trees in these measurements: PH and PHM.

If your are happy with approximate KNN, you can also have a look at https://github.com/erikbern/ann-benchmarks

Hilbert Curves

The PH-tree follows a Z-curve (Morton order) which has almost the same locality preserving features that Hilbert curves have, however z-curves a much easier and faster to calculate than Hilbert curves.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how well it performs on high dimensional data, such as SIFT #39

how well it performs on high dimensional data, such as SIFT #39

wlzhao22 commented Aug 8, 2024

tzaeschke commented Aug 8, 2024

how well it performs on high dimensional data, such as SIFT #39

how well it performs on high dimensional data, such as SIFT #39

Comments

wlzhao22 commented Aug 8, 2024

tzaeschke commented Aug 8, 2024

Short answer

Long answer

Hilbert Curves