[FEA] Strongly filtered CAGRA #480

achirkin · 2024-11-20T07:56:26Z

CAGRA has been observed to yield low recall when filtering is enabled, especially when the ratio of filtered-out values is high. This can be related in part to #208 and #472 , but there also may be fundamental reasons for the lower recall.

This feature request tracks the progress and suggestions to enable high-recall strongly filtered CAGRA.

As an experiment, I suggest to try the following tweaks, enabled by a boolean search parameter:

Disable the maximum search iterations limit to allow longer search
Replace the hashmap with a dataset-long bitset. It's used to track the visited nodes. By replacing a small hashmap with the bitset we will eliminate hash collisions (thus, false-positives) and prevent CAGRA from early-stopping.

achirkin · 2024-11-20T07:57:16Z

Related: BFKNN as a strongly-filtered CAGRA replacement #252

anaruse · 2024-11-25T11:05:32Z

In general, in order to achieve a good recall when the filtering rate is high, it is necessary to traverse more nodes. For example, if the the filtering rate is very high at 99%, it will be difficult to achieve a recall similar to that without filtering unless you traverse roughly 100 times as many nodes. To do this, the size of the hash table would also need to be increased by a factor of 100, but this is not very practical.

We have just submitted a PR for a new multi-CTA algorithm that can reduce the size of the hash table by a factor of 1/32 to 1/64, so I think it would be good to use this new multi-CTA algorithm to address this issue.

achirkin · 2024-11-25T18:17:05Z

@anaruse what would you think about using a dataset-long bitset in place of the hashmap here as an experiment? The bitset I guess will have to stay in global memory; do you think the performance will be unreasonably bad?

anaruse · 2024-11-26T04:39:53Z

Using a dataset-long bitset would also be a good approach. Perhaps performance is not an issue when using a bitset, it is the memory usage that should be a concern. I think it is a good idea to use bitset when the memory usage of the hash table exceeds the memory usage of the bitset.

achirkin added the feature request New feature or request label Nov 20, 2024

achirkin added this to VS/ML/DM Primitives Release Board Nov 20, 2024

achirkin moved this to Todo in VS/ML/DM Primitives Release Board Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Strongly filtered CAGRA #480

[FEA] Strongly filtered CAGRA #480

achirkin commented Nov 20, 2024

achirkin commented Nov 20, 2024 •

edited

Loading

anaruse commented Nov 25, 2024

achirkin commented Nov 25, 2024

anaruse commented Nov 26, 2024

[FEA] Strongly filtered CAGRA #480

[FEA] Strongly filtered CAGRA #480

Comments

achirkin commented Nov 20, 2024

achirkin commented Nov 20, 2024 • edited Loading

anaruse commented Nov 25, 2024

achirkin commented Nov 25, 2024

anaruse commented Nov 26, 2024

achirkin commented Nov 20, 2024 •

edited

Loading