Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/indexing faissless #173

Merged
merged 19 commits into from
Mar 18, 2024
Merged

Feat/indexing faissless #173

merged 19 commits into from
Mar 18, 2024

Conversation

bclavie
Copy link
Collaborator

@bclavie bclavie commented Mar 15, 2024

Faiss is the source of a ridiculous amount of issues with this repository, and is becoming increasingly harder to play around in wide-userbase repositories as the versions compatible with recent CUDA drivers are only available via conda or by building from source.

This PR introduces:

  • A slight fix for RAGatouille: Dedicated ModelIndex #158 to avoid force-reloading the searcher
  • An implementation of K-Means in raw pytorch
  • Indexing to default to this implementation for any collection size smaller than 500k documents (the assumption being that users with massive collections are likely to be attached to the canonical, research behaviour of colbert-ai)
  • Users to be presented with a warning message about the behaviour change.
  • The possibility to pass use_faiss=True to force faiss being used.
  • A fallback behaviour where any exception when indexing with the torch kmeans will default to the existing behaviour.
  • If this is successful, v0.9.0 will remove faiss as a dependency

@bclavie bclavie requested review from okhat and Anmol6 March 15, 2024 13:25
@bclavie bclavie requested a review from jlscheerer March 15, 2024 16:56
@bclavie
Copy link
Collaborator Author

bclavie commented Mar 15, 2024

@jlscheerer will merge if this looks good to you!

ragatouille/models/index.py Show resolved Hide resolved
"If you're confident with FAISS working issue-free on your machine, pass use_faiss=True to revert to the FAISS-using behaviour."
)
print("--------------------")
CollectionIndexer._original_train_kmeans = CollectionIndexer._train_kmeans
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use another variable to track this? Would avoid directly setting object attributes!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could yeah! This was mostly for the convenience of checking on hasattr later on, but it might be better practice to set it to a new object instead. I'll change it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned on Discord but I'm actually thinking this is a relatively sane way of doing it, because we need to keep it alive for the entirety of the session -- we're monkey-patching the colbert-ai indexer itself and we want to be able to revert anytime someone needs to use faiss, so local variables wouldn't cut it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right yea, makes sense!

In this case, can we assign the faiss and non-faiss k-means functions as class attributes of PLAIDModelIndex. Then, at build time, we can just toggle between them (to set CollectionIndexer._train_k_means) based on the monkey_patch flag. Wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to get your thoughts here too @jlscheerer !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having it be a class attribute on PLAIDModelIndex and toggling it on use (+ persisting the flag). This would perhaps provide more consistent behaviour when rebuilding/adding to an already persisted index (e.g., if we decide to rebuild as part of add_to_index).

Copy link
Collaborator Author

@bclavie bclavie Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! Implemented now.

ragatouille/models/index.py Show resolved Hide resolved
ragatouille/models/torch_kmeans.py Show resolved Hide resolved
ragatouille/models/index.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jlscheerer jlscheerer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! Besides some minor inconsistencies for low score documents everything works well for me locally (albeit on a small corpus). Really only have some very minor nits!

ragatouille/models/torch_kmeans.py Outdated Show resolved Hide resolved
@bclavie
Copy link
Collaborator Author

bclavie commented Mar 15, 2024

I'll be off now and probably away most of the weekend, but pointing it out here:

@jlscheerer found out that we can get empty results for queries with no relevant docs. I'm not 100% sure what the cause is, but I think one potential reason is that faiss doesn't really allow any empty clusters at the end of an iteration, whereas I'm pretty sure our approach isn't safe from empty clusters. Thankfully, it'll be pretty trivial to add empty cluster handling and see if it fixex it!

@bclavie
Copy link
Collaborator Author

bclavie commented Mar 18, 2024

Re-implemented the K-means to be closer to faiss' implementation. I was struggling to reproduce benchmark (JaColBERT on JSQuAD was losing 4pts recall@1, 3pts recall@5). New implementation gets virtually identical results (91.99R@1 vs 92.07 for FAISS, 97.659R@5 vs 97.658R@5 for FAISS). Running on a larger benchmark now to make sure as I apply the other fixes, but likely looking good enough to release today.

@jlscheerer could you quickly try again on your dataset? There's now empty cluster handling implemented, so empty results shouldn't occur anymore.

Copy link
Collaborator

@Anmol6 Anmol6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@bclavie
Copy link
Collaborator Author

bclavie commented Mar 18, 2024

Stepping back on my enthusiasm a tiny bit: it OOMs pretty easily, trying to improve batching/memory usage without impacting performance 😄

feat: rework kmeans to be closer to FAISS

chore: store kmeans functions as class attributes

fix: method assignment

chore: more memory efficient

lint

chore: lower bsize, resultd unaffected

feat: better batching, slower max doc count

chore: batch size safe for 8gb GPUs

chore: more elaborate warning

chore: use external lib to support minibatching, revert to homebrew later
@bclavie bclavie force-pushed the feat/indexing_faissless branch from 98f6acf to d89656b Compare March 18, 2024 18:00
@bclavie bclavie enabled auto-merge (squash) March 18, 2024 19:20
@bclavie bclavie merged commit d27b693 into main Mar 18, 2024
2 checks passed
@bclavie bclavie deleted the feat/indexing_faissless branch March 18, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants