-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hybrid search example code (for docs) #30
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Morgan Gallant <[email protected]>
if doc_id in bm25_ranks: | ||
score += 1.0 / (k + bm25_ranks[doc_id]) | ||
if doc_id in vector_ranks: | ||
score += 1.0 / (k + vector_ranks[doc_id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, is classic RRF only rank-based? Nothing about scores? Probably fine to begin with
@@ -0,0 +1,122 @@ | |||
import turbopuffer as tpuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this run as part of the test suite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No major qualms here; just make sure it runs as part of the test suite (I left some comments about the code that references this)
|
||
# Fuses two search result sets together into one | ||
# Uses reciprocal rank fusion | ||
def rank_fusion(bm25_results, vector_results, k=60): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to inline/copy-paste this in a few places, so I think we need 1 top-level comments and then the function to be as small as possible, e.g. something alone the lines of this (not tested):
def results_to_ranks(results, reverse=False):
return {item.id: rank for rank, item in enumerate(sorted(results, key=lambda item: item.dist, reverse=reverse), start=1)}
def rank_fusion(bm25_results, vector_results, k=60):
bm25_ranks, vector_ranks = results_to_ranks(bm25_results), results_to_ranks(vector_results, reverse=True)
scores = {doc_id: (1 / (k + bm25_ranks[doc_id]) if doc_id in bm25_ranks else 0) +
(1 / (k + vector_ranks[doc_id]) if doc_id in vector_ranks else 0)
for doc_id in set(bm25_ranks) | set(vector_ranks)}
return [{"id": doc_id, "score": score} for doc_id, score in sorted(scores.items(), key=lambda item: item[1], reverse=True)]
The more code you have to paste, the more you'll question why we don't build this into this library (good reasons). The shorter it is, the more you'll want to tweak it (good).
I don't think any of the other examples run as part of tests, should we make an exception here? I'm not convinced, this is moreso documentation than anything else |
Signed-off-by: Morgan Gallant <[email protected]>
No description provided.