-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParentJoin Benchmarks for KNN Search #296
Conversation
Very exciting! I will try to review the code changes soon ... thanks @vigyasharma.
How do I get the source (vectors) file input to run this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks you @vigyasharma! I'm very curious where/how I can get the parent/join meta file to try running this myself...
Thanks for the prompt review @mikemccand
We can use the python src/python/infer_token_vectors_cohere.py -d <num_docs> -q <num_queries> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw one accidental code block dup -- then let's merge!
Resolved conflicts and merge duplication errors. I also like the new output from knnGraphTester with more graph details.. reindex takes 14.05 sec
Force merge index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
Force merge done in 12.76 sec
index has 1 segments
index disk uage is 295.02 MB
SUMMARY: 0.098 0.725 101323 10 6 32 50 no 9 14.05 12.76 1 295.02 1.00 post-filter
Leaf 0 has 4 layers
Leaf 0 has 101323 documents
Graph level=3 size=6, Fanout min=1, mean=2.67, max=4, meandelta=10062.31
% 0 10 20 30 40 50 60 70 80 90 100
0 1 1 1 2 3 3 3 3 3 4 4
Graph level=2 size=61, Fanout min=1, mean=7.54, max=16, meandelta=7024.34
% 0 10 20 30 40 50 60 70 80 90 100
0 3 5 6 7 7 8 9 10 11 16
Graph level=1 size=2994, Fanout min=1, mean=4.51, max=32, meandelta=5549.65
% 0 10 20 30 40 50 60 70 80 90 100
0 1 1 1 1 1 1 5 9 13 32
Graph level=0 size=100000, Fanout min=1, mean=3.81, max=64, meandelta=3386.53
% 0 10 20 30 40 50 60 70 80 90 100
0 1 1 1 3 3 3 3 3 3 64
Graph level=3 size=6, connectedness=1.00
Graph level=2 size=61, connectedness=1.00
Graph level=1 size=2994, connectedness=1.00
Graph level=0 size=100000, connectedness=0.96
Results:
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB)
0.098 0.725 101323 10 6 32 50 no 14.05 12.76 1 295.02
|
Thanks @vigyasharma -- this is an exciting improvement to KNN benchmarking! |
Adds parent join benchmarks for KNN Search. We use the passage search use-case with cohere embeddings created from wikipedia. Each parent document corresponds to a wikipedia article, and child documents correspond to paragraphs (chunk) within the article. Embeddings are only present for child documents.
This change leverages Lucene's
DiversifyingChildrenFloatKnnVectorQuery
, usingexactSearch()
for baseline, andapproximateSearch()
for knn search. Recall is computed by calculating overlap between the two.Note: We can use the
infer_token_vectors_cohere.py
script to generate the parentJoin metadata file for Cohere embeddings dataset.__
Sample Run Results