-
-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/indexing faissless #173
Conversation
@jlscheerer will merge if this looks good to you! |
ragatouille/models/index.py
Outdated
"If you're confident with FAISS working issue-free on your machine, pass use_faiss=True to revert to the FAISS-using behaviour." | ||
) | ||
print("--------------------") | ||
CollectionIndexer._original_train_kmeans = CollectionIndexer._train_kmeans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use another variable to track this? Would avoid directly setting object attributes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could yeah! This was mostly for the convenience of checking on hasattr
later on, but it might be better practice to set it to a new object instead. I'll change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned on Discord but I'm actually thinking this is a relatively sane way of doing it, because we need to keep it alive for the entirety of the session -- we're monkey-patching the colbert-ai
indexer itself and we want to be able to revert anytime someone needs to use faiss
, so local variables wouldn't cut it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right yea, makes sense!
In this case, can we assign the faiss and non-faiss k-means functions as class attributes of PLAIDModelIndex
. Then, at build time, we can just toggle between them (to set CollectionIndexer._train_k_means) based on the monkey_patch flag. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious to get your thoughts here too @jlscheerer !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having it be a class attribute on PLAIDModelIndex
and toggling it on use (+ persisting the flag). This would perhaps provide more consistent behaviour when rebuilding/adding to an already persisted index (e.g., if we decide to rebuild as part of add_to_index
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion! Implemented now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome! Besides some minor inconsistencies for low score documents everything works well for me locally (albeit on a small corpus). Really only have some very minor nits!
I'll be off now and probably away most of the weekend, but pointing it out here: @jlscheerer found out that we can get empty results for queries with no relevant docs. I'm not 100% sure what the cause is, but I think one potential reason is that |
Re-implemented the K-means to be closer to faiss' implementation. I was struggling to reproduce benchmark (JaColBERT on JSQuAD was losing 4pts recall@1, 3pts recall@5). New implementation gets virtually identical results (91.99R@1 vs 92.07 for FAISS, 97.659R@5 vs 97.658R@5 for FAISS). Running on a larger benchmark now to make sure as I apply the other fixes, but likely looking good enough to release today. @jlscheerer could you quickly try again on your dataset? There's now empty cluster handling implemented, so empty results shouldn't occur anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
Stepping back on my enthusiasm a tiny bit: it OOMs pretty easily, trying to improve batching/memory usage without impacting performance 😄 |
feat: rework kmeans to be closer to FAISS chore: store kmeans functions as class attributes fix: method assignment chore: more memory efficient lint chore: lower bsize, resultd unaffected feat: better batching, slower max doc count chore: batch size safe for 8gb GPUs chore: more elaborate warning chore: use external lib to support minibatching, revert to homebrew later
98f6acf
to
d89656b
Compare
Faiss is the source of a ridiculous amount of issues with this repository, and is becoming increasingly harder to play around in wide-userbase repositories as the versions compatible with recent CUDA drivers are only available via conda or by building from source.
This PR introduces:
colbert-ai
)use_faiss=True
to force faiss being used.faiss
as a dependency