Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParentJoin Benchmarks for KNN Search #296

Merged
merged 17 commits into from
Sep 23, 2024
Merged

Conversation

vigyasharma
Copy link
Contributor

@vigyasharma vigyasharma commented Sep 6, 2024

Adds parent join benchmarks for KNN Search. We use the passage search use-case with cohere embeddings created from wikipedia. Each parent document corresponds to a wikipedia article, and child documents correspond to paragraphs (chunk) within the article. Embeddings are only present for child documents.

This change leverages Lucene's DiversifyingChildrenFloatKnnVectorQuery, using exactSearch() for baseline, and approximateSearch() for knn search. Recall is computed by calculating overlap between the two.

Note: We can use the infer_token_vectors_cohere.py script to generate the parentJoin metadata file for Cohere embeddings dataset.

python src/python/infer_token_vectors_cohere.py -d <num_docs> -q <num_queries>

__

Sample Run Results

# parent join with quantization
numDocs = 250000
maxConn = 32
beamWidth = 50
Vector Dimensions: 768
Index Path = knnIndices/cohere-wikipedia-docs-768d.vec-32-50-8-parentJoin.index
Sep 06, 2024 2:37:08 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256
creating index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-8-parentJoin.index
parentJoin=true
Parent join metaFile columns: wiki_id | para_id
indexed 25000 child documents, with 276 parents
indexed 50000 child documents, with 592 parents
indexed 75000 child documents, with 949 parents
indexed 100000 child documents, with 1322 parents
indexed 125000 child documents, with 1725 parents
indexed 150000 child documents, with 2107 parents
indexed 175000 child documents, with 2527 parents
indexed 200000 child documents, with 2938 parents
indexed 225000 child documents, with 3379 parents
indexed 250000 child documents, with 3803 parents
Indexed 250000 documents with 3803 parent docs. now flush
Indexed 250000 docs in 167 seconds
reindex takes 167694 ms
running 1000 targets; topK=100, fanout=20
completed 1000 searches in 13631 ms: 73 QPS CPU time=13424ms
checking results
SUMMARY: 0.015  13.42   253804  20      32      50      8 bits  100     167694  1.00    post-filter

Results:
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.158    3.96   253804  20      32      50      4 bits  100     25213   1.00    post-filter
0.162    4.00   253804  20      32      50      7 bits  100     24277   1.00    post-filter
0.015   13.42   253804  20      32      50      8 bits  100     167694  1.00    post-filter
# parentJoin without quantization
numDocs = 250000
maxConn = 32
beamWidth = 50
Vector Dimensions: 768
Index Path = knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
Sep 06, 2024 2:43:01 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256
creating index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
parentJoin=true
Parent join metaFile columns: wiki_id | para_id
indexed 25000 child documents, with 276 parents
indexed 50000 child documents, with 592 parents
indexed 75000 child documents, with 949 parents
indexed 100000 child documents, with 1322 parents
indexed 125000 child documents, with 1725 parents
indexed 150000 child documents, with 2107 parents
indexed 175000 child documents, with 2527 parents
indexed 200000 child documents, with 2938 parents
indexed 225000 child documents, with 3379 parents
indexed 250000 child documents, with 3803 parents
Indexed 250000 documents with 3803 parent docs. now flush
Indexed 250000 docs in 27 seconds
reindex takes 27412 ms
running 1000 targets; topK=100, fanout=20
completed 1000 searches in 6307 ms: 158 QPS CPU time=6224ms
checking results
SUMMARY: 0.167   6.22   253804  20      32      50      no      100     27412   1.00    post-filter

Results:
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.167    6.22   253804  20      32      50      no      100     27412   1.00    post-filter
# default run (no parentJoin)
numDocs = 250000
maxConn = 32
beamWidth = 50
Vector Dimensions: 768
Sep 06, 2024 2:49:42 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256
Done indexing 25000 documents.
Done indexing 50000 documents.
Done indexing 75000 documents.
Done indexing 100000 documents.
Done indexing 125000 documents.
Done indexing 150000 documents.
Done indexing 175000 documents.
Done indexing 200000 documents.
Done indexing 225000 documents.
Done indexing 250000 documents.
reindex takes 86058 ms
SUMMARY: 0.004   3.49   250000  20      32      50      8 bits  9531    86058   1.00    post-filter

Results:
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.565    1.89   250000  20      32      50      4 bits  4610    43426   1.00    post-filter
0.820    1.62   250000  20      32      50      7 bits  4198    43593   1.00    post-filter
0.004    3.49   250000  20      32      50      8 bits  9531    86058   1.00    post-filter

@mikemccand
Copy link
Owner

Very exciting! I will try to review the code changes soon ... thanks @vigyasharma.

We use the passage search use-case with cohere embeddings created from wikipedia. Each parent document corresponds to a wikipedia article, and child documents correspond to paragraphs (chunk) within the article. Embeddings are only present for child documents.

How do I get the source (vectors) file input to run this?

Copy link
Owner

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks you @vigyasharma! I'm very curious where/how I can get the parent/join meta file to try running this myself...

src/main/knn/KnnGraphTester.java Outdated Show resolved Hide resolved
src/main/knn/KnnTesterUtils.java Show resolved Hide resolved
src/main/knn/ParentJoinBenchmarkQuery.java Outdated Show resolved Hide resolved
src/main/knn/ParentJoinBenchmarkQuery.java Outdated Show resolved Hide resolved
src/main/knn/KnnGraphTester.java Show resolved Hide resolved
src/main/knn/KnnGraphTester.java Show resolved Hide resolved
src/main/knn/ParentJoinBenchmarkQuery.java Show resolved Hide resolved
src/main/knn/KnnTesterUtils.java Show resolved Hide resolved
src/main/knn/KnnGraphTester.java Show resolved Hide resolved
src/main/knn/KnnGraphTester.java Outdated Show resolved Hide resolved
@vigyasharma
Copy link
Contributor Author

Thanks for the prompt review @mikemccand

I'm very curious where/how I can get the parent/join meta file to try running this myself...

We can use the python src/python/infer_token_vectors_cohere.py script. We had merged in a change earlier (#283), to update the tool to create a metadata file as well.

python src/python/infer_token_vectors_cohere.py -d <num_docs> -q <num_queries>

Copy link
Owner

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw one accidental code block dup -- then let's merge!

src/main/knn/KnnGraphTester.java Outdated Show resolved Hide resolved
@vigyasharma
Copy link
Contributor Author

Resolved conflicts and merge duplication errors. I also like the new output from knnGraphTester with more graph details..

reindex takes 14.05 sec
Force merge index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
Force merge done in 12.76 sec
index has 1 segments
index disk uage is 295.02 MB
SUMMARY: 0.098  0.725   101323  10      6       32      50      no      9       14.05   12.76   1       295.02  1.00    post-filter
Leaf 0 has 4 layers
Leaf 0 has 101323 documents
Graph level=3 size=6, Fanout min=1, mean=2.67, max=4, meandelta=10062.31
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   2   3   3   3   3   3   4   4
Graph level=2 size=61, Fanout min=1, mean=7.54, max=16, meandelta=7024.34
%   0  10  20  30  40  50  60  70  80  90 100
    0   3   5   6   7   7   8   9  10  11  16
Graph level=1 size=2994, Fanout min=1, mean=4.51, max=32, meandelta=5549.65
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   5   9  13  32
Graph level=0 size=100000, Fanout min=1, mean=3.81, max=64, meandelta=3386.53
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   3   3   3   3   3   3  64
Graph level=3 size=6, connectedness=1.00
Graph level=2 size=61, connectedness=1.00
Graph level=1 size=2994, connectedness=1.00
Graph level=0 size=100000, connectedness=0.96

Results:
recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.098         0.725  101323    10       6       32         50         no    14.05          12.76             1           295.02

@mikemccand mikemccand merged commit fb80610 into mikemccand:main Sep 23, 2024
@mikemccand
Copy link
Owner

Thanks @vigyasharma -- this is an exciting improvement to KNN benchmarking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants