-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] How to choose parameters for CAGRA? #571
Comments
Hey @cjnolet is the general idea that higher values for |
@abs51295 correct, but at the expense of slower build time. Also, keeping mind that the search parameters can be tuned as well, so you could still get good recall with a model that was faster to build, but at the expense of search latency. |
Thanks @cjnolet . Although, I wonder if there's any way to choose those parameters depending on how many neighbors you want to query and dataset size. For example, the pynndescent library aims for higher accuracy (80-90%) by default. Across all ann benchmarks they show that the parameters make sense. I was hoping for something similar for CAGRA where we would have some guidelines for reasonable defaults similar to pynndescent. |
Another followup question: I see that there's CAGRA multi-GPU implementation but I don't see an example of how to use it from cuVS python API. Is it only in C++? |
Normally one would go through a tuning process with cross validation (using recall as the metric) like any other machine learning model in order to configure the index and search parameters to their needs. We touch on this in our getting started materials here. That is correct about our multi-gpu APIs. Currently they are very experimental, but we do plan to eventually expose them to the various different language wrappers (such as C, Python and even Java). |
Thanks @cjnolet for the pointers. I ran 4 models on my dataset of size
This is all compared to brute force kneighbors from |
Hey @abs51295, Just FYI- cuML is actually using cuVS under the hood for brute force, so you are welcome to just use cuVS python api for that to make a comparison. It'll at least clean your code up so you aren't having to call into two different libraries. It is true generally (for most ANN algorithms) that increasing the parameters (or decreasing depending on the parameter) will yield higher recall. That's not the first time I've seen large graph degree / intermediate graph degree cause a recall like that and I suspect that's a bug (either in the recall computation or in the cuVS algorithm). I would suggest taking a look at cuVS bench if you want to compare Ann algorithms side by side. It'll run a full sweep of the parameter space and provide you the curve of best performing parameters (in terms of query throughput for each achieved recall level. It's important to consider the query latency/throughput when you tune models because you could get a model that seems to give really great recall and find that it's throughput is too low to be acceptable. Here's the cuVS bench tool docs if interested (we had to create our own tool because non of the existing tools care to measure build time): https://docs.rapids.ai/api/cuvs/nightly/cuvs_bench/ Please also note that we did just migrate this over from the RAFT library and have found some small bugs in the docs recently so I apologize for those up front, and we have a PR up to fix them. |
Thanks @cjnolet. I checked the recall calculation code and it seems fine. So the problem is likely in the cuVS implementation of CAGRA. I was just testing this on a small dataset of 12M vectors but we have much bigger datasets of 120M vectors that we want to run CAGRA on and if this issue persists I am worried that just going by the fact to have the largest model possible (for best recall) isn't going to fare well for us. I am fine with throughput being low. We can't do brute force as the dataset size increases so for testing we reduced the size. Have you seen using |
Please also note that itopk is a search parameter and you can can vary the recall (even for smaller models) using the search parameters. I notice in your tests you were providing only a single itopk. Have you tried varying that at all? Also note that if the default parameters are giving you acceptable recall you may not need to go any further. Is 99.99% recall with default params providing a good enough search throughput? The model size doesn’t necessarily scale with dataset size. |
I haven't tried varying |
I want to select a good choice of parameters that would give me high recall (>95%) for two large datasets that I have. One of them has 120M rows and the other dataset has 50M rows. I have access to an NVIDIA A100 GPU with 80G VRAM. I want to compute 50 neighbors for every data point and so far the only change I have made is use
itopk_size=128
when searching. Since datasets are too large I am not able to perform a brute force search and calculate recall. I was wondering if there are any guidelines on choosing parameters.The text was updated successfully, but these errors were encountered: