Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Strongly filtered CAGRA #480

Open
achirkin opened this issue Nov 20, 2024 · 4 comments
Open

[FEA] Strongly filtered CAGRA #480

achirkin opened this issue Nov 20, 2024 · 4 comments
Labels
feature request New feature or request

Comments

@achirkin
Copy link
Contributor

CAGRA has been observed to yield low recall when filtering is enabled, especially when the ratio of filtered-out values is high. This can be related in part to #208 and #472 , but there also may be fundamental reasons for the lower recall.

This feature request tracks the progress and suggestions to enable high-recall strongly filtered CAGRA.

As an experiment, I suggest to try the following tweaks, enabled by a boolean search parameter:

  • Disable the maximum search iterations limit to allow longer search
  • Replace the hashmap with a dataset-long bitset. It's used to track the visited nodes. By replacing a small hashmap with the bitset we will eliminate hash collisions (thus, false-positives) and prevent CAGRA from early-stopping.
@achirkin
Copy link
Contributor Author

achirkin commented Nov 20, 2024

Related: BFKNN as a strongly-filtered CAGRA replacement #252

@anaruse
Copy link
Contributor

anaruse commented Nov 25, 2024

In general, in order to achieve a good recall when the filtering rate is high, it is necessary to traverse more nodes. For example, if the the filtering rate is very high at 99%, it will be difficult to achieve a recall similar to that without filtering unless you traverse roughly 100 times as many nodes. To do this, the size of the hash table would also need to be increased by a factor of 100, but this is not very practical.

We have just submitted a PR for a new multi-CTA algorithm that can reduce the size of the hash table by a factor of 1/32 to 1/64, so I think it would be good to use this new multi-CTA algorithm to address this issue.

@achirkin
Copy link
Contributor Author

@anaruse what would you think about using a dataset-long bitset in place of the hashmap here as an experiment? The bitset I guess will have to stay in global memory; do you think the performance will be unreasonably bad?

@anaruse
Copy link
Contributor

anaruse commented Nov 26, 2024

Using a dataset-long bitset would also be a good approach. Perhaps performance is not an issue when using a bitset, it is the memory usage that should be a concern. I think it is a good idea to use bitset when the memory usage of the hash table exceeds the memory usage of the bitset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Development

No branches or pull requests

2 participants