Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How to choose parameters for CAGRA? #571

Open
abs51295 opened this issue Jan 14, 2025 · 11 comments
Open

[QST] How to choose parameters for CAGRA? #571

abs51295 opened this issue Jan 14, 2025 · 11 comments
Labels
question Further information is requested

Comments

@abs51295
Copy link

I want to select a good choice of parameters that would give me high recall (>95%) for two large datasets that I have. One of them has 120M rows and the other dataset has 50M rows. I have access to an NVIDIA A100 GPU with 80G VRAM. I want to compute 50 neighbors for every data point and so far the only change I have made is use itopk_size=128 when searching. Since datasets are too large I am not able to perform a brute force search and calculate recall. I was wondering if there are any guidelines on choosing parameters.

@abs51295 abs51295 added the question Further information is requested label Jan 14, 2025
@cjnolet
Copy link
Member

cjnolet commented Jan 14, 2025

Hi @abs51295,

Please take a look at our index documentation here and let us know if more information would help.

We also have a blog that explains some of these parameters.

@abs51295
Copy link
Author

Hey @cjnolet is the general idea that higher values for graph_degree, intermediate_graph_degree and itopk_size the better the model is?

@cjnolet
Copy link
Member

cjnolet commented Jan 15, 2025

@abs51295 correct, but at the expense of slower build time. Also, keeping mind that the search parameters can be tuned as well, so you could still get good recall with a model that was faster to build, but at the expense of search latency.

@abs51295
Copy link
Author

Thanks @cjnolet . Although, I wonder if there's any way to choose those parameters depending on how many neighbors you want to query and dataset size. For example, the pynndescent library aims for higher accuracy (80-90%) by default. Across all ann benchmarks they show that the parameters make sense. I was hoping for something similar for CAGRA where we would have some guidelines for reasonable defaults similar to pynndescent.

@abs51295
Copy link
Author

abs51295 commented Jan 15, 2025

Another followup question: I see that there's CAGRA multi-GPU implementation but I don't see an example of how to use it from cuVS python API. Is it only in C++?

@cjnolet
Copy link
Member

cjnolet commented Jan 16, 2025

Although, I wonder if there's any way to choose those parameters depending on how many neighbors you want to query and dataset size.

Normally one would go through a tuning process with cross validation (using recall as the metric) like any other machine learning model in order to configure the index and search parameters to their needs. We touch on this in our getting started materials here.

That is correct about our multi-gpu APIs. Currently they are very experimental, but we do plan to eventually expose them to the various different language wrappers (such as C, Python and even Java).

@abs51295
Copy link
Author

abs51295 commented Jan 16, 2025

Thanks @cjnolet for the pointers. I ran 4 models on my dataset of size (11852927, 10) for k=50. Here's the model specs, runtime and recall:

  1. Default model (graph_degree=64, itopk_size=64, intermediate_graph_degree=128): recall of 0.9999833070768089 with runtime of 132.02 seconds
  2. Model with graph_degree=128, itopk_size=128, intermediate_graph_degree=256: recall of 0.7581069874133199 with runtime of 257.97 seconds
  3. Model with graph_degree=256, itopk_size=256, intermediate_graph_degree=512: recall of 0.9751363675824545 with runtime of 1149.91 seconds
  4. Model with graph_degree=512, itopk_size=512, intermediate_graph_degree=1024: recall of 0.22899715487997185with runtime of 7212.99 seconds

This is all compared to brute force kneighbors from cuml with metric='euclidean' and algorithm='brute'. The runtime for that was 822.18 seconds. My understanding was that larger the model, the better but it doesn't seem to be the case in terms of recall here? Also, the run time for the last two models is way more than the actual brute force. I think that's expected but not sure. The recall is measured from from cuvs.test.ann_utils import calc_recall and all the models use metric='sqeuclidean' and build_algo='nn_descent'.

@cjnolet
Copy link
Member

cjnolet commented Jan 16, 2025

Hey @abs51295,

Just FYI- cuML is actually using cuVS under the hood for brute force, so you are welcome to just use cuVS python api for that to make a comparison. It'll at least clean your code up so you aren't having to call into two different libraries.

It is true generally (for most ANN algorithms) that increasing the parameters (or decreasing depending on the parameter) will yield higher recall. That's not the first time I've seen large graph degree / intermediate graph degree cause a recall like that and I suspect that's a bug (either in the recall computation or in the cuVS algorithm).

I would suggest taking a look at cuVS bench if you want to compare Ann algorithms side by side. It'll run a full sweep of the parameter space and provide you the curve of best performing parameters (in terms of query throughput for each achieved recall level. It's important to consider the query latency/throughput when you tune models because you could get a model that seems to give really great recall and find that it's throughput is too low to be acceptable.

Here's the cuVS bench tool docs if interested (we had to create our own tool because non of the existing tools care to measure build time): https://docs.rapids.ai/api/cuvs/nightly/cuvs_bench/

Please also note that we did just migrate this over from the RAFT library and have found some small bugs in the docs recently so I apologize for those up front, and we have a PR up to fix them.

@abs51295
Copy link
Author

abs51295 commented Jan 16, 2025

Thanks @cjnolet. I checked the recall calculation code and it seems fine. So the problem is likely in the cuVS implementation of CAGRA. I was just testing this on a small dataset of 12M vectors but we have much bigger datasets of 120M vectors that we want to run CAGRA on and if this issue persists I am worried that just going by the fact to have the largest model possible (for best recall) isn't going to fare well for us. I am fine with throughput being low. We can't do brute force as the dataset size increases so for testing we reduced the size. Have you seen using build_algo='ivf_pq' fix this or any other way that fixes this issue?

@cjnolet
Copy link
Member

cjnolet commented Jan 16, 2025

Please also note that itopk is a search parameter and you can can vary the recall (even for smaller models) using the search parameters. I notice in your tests you were providing only a single itopk. Have you tried varying that at all?

Also note that if the default parameters are giving you acceptable recall you may not need to go any further. Is 99.99% recall with default params providing a good enough search throughput? The model size doesn’t necessarily scale with dataset size.

@abs51295
Copy link
Author

abs51295 commented Jan 16, 2025

I haven't tried varying itopk at all for the same model. Yeah, the 99.99% recall is perfect and the runtime I mentioned includes the building the index + searching so yes it's looking very good in terms of throughput. I am fine with using the default model but this was just an exercise to understand the algorithm better. Do you think that for some value of itopk, the other models would give the same recall (99.99%)? Also, do you think that choosing a model on a subset of data (20M vectors out of 120M vectors) makes sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants