ParentJoin Benchmarks for KNN Search #296

vigyasharma · 2024-09-06T21:58:33Z

Adds parent join benchmarks for KNN Search. We use the passage search use-case with cohere embeddings created from wikipedia. Each parent document corresponds to a wikipedia article, and child documents correspond to paragraphs (chunk) within the article. Embeddings are only present for child documents.

This change leverages Lucene's DiversifyingChildrenFloatKnnVectorQuery, using exactSearch() for baseline, and approximateSearch() for knn search. Recall is computed by calculating overlap between the two.

Note: We can use the infer_token_vectors_cohere.py script to generate the parentJoin metadata file for Cohere embeddings dataset.

python src/python/infer_token_vectors_cohere.py -d <num_docs> -q <num_queries>

__

Sample Run Results

# parent join with quantization
numDocs = 250000
maxConn = 32
beamWidth = 50
Vector Dimensions: 768
Index Path = knnIndices/cohere-wikipedia-docs-768d.vec-32-50-8-parentJoin.index
Sep 06, 2024 2:37:08 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256
creating index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-8-parentJoin.index
parentJoin=true
Parent join metaFile columns: wiki_id | para_id
indexed 25000 child documents, with 276 parents
indexed 50000 child documents, with 592 parents
indexed 75000 child documents, with 949 parents
indexed 100000 child documents, with 1322 parents
indexed 125000 child documents, with 1725 parents
indexed 150000 child documents, with 2107 parents
indexed 175000 child documents, with 2527 parents
indexed 200000 child documents, with 2938 parents
indexed 225000 child documents, with 3379 parents
indexed 250000 child documents, with 3803 parents
Indexed 250000 documents with 3803 parent docs. now flush
Indexed 250000 docs in 167 seconds
reindex takes 167694 ms
running 1000 targets; topK=100, fanout=20
completed 1000 searches in 13631 ms: 73 QPS CPU time=13424ms
checking results
SUMMARY: 0.015  13.42   253804  20      32      50      8 bits  100     167694  1.00    post-filter

Results:
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.158    3.96   253804  20      32      50      4 bits  100     25213   1.00    post-filter
0.162    4.00   253804  20      32      50      7 bits  100     24277   1.00    post-filter
0.015   13.42   253804  20      32      50      8 bits  100     167694  1.00    post-filter

# parentJoin without quantization
numDocs = 250000
maxConn = 32
beamWidth = 50
Vector Dimensions: 768
Index Path = knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
Sep 06, 2024 2:43:01 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256
creating index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
parentJoin=true
Parent join metaFile columns: wiki_id | para_id
indexed 25000 child documents, with 276 parents
indexed 50000 child documents, with 592 parents
indexed 75000 child documents, with 949 parents
indexed 100000 child documents, with 1322 parents
indexed 125000 child documents, with 1725 parents
indexed 150000 child documents, with 2107 parents
indexed 175000 child documents, with 2527 parents
indexed 200000 child documents, with 2938 parents
indexed 225000 child documents, with 3379 parents
indexed 250000 child documents, with 3803 parents
Indexed 250000 documents with 3803 parent docs. now flush
Indexed 250000 docs in 27 seconds
reindex takes 27412 ms
running 1000 targets; topK=100, fanout=20
completed 1000 searches in 6307 ms: 158 QPS CPU time=6224ms
checking results
SUMMARY: 0.167   6.22   253804  20      32      50      no      100     27412   1.00    post-filter

Results:
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.167    6.22   253804  20      32      50      no      100     27412   1.00    post-filter

# default run (no parentJoin)
numDocs = 250000
maxConn = 32
beamWidth = 50
Vector Dimensions: 768
Sep 06, 2024 2:49:42 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256
Done indexing 25000 documents.
Done indexing 50000 documents.
Done indexing 75000 documents.
Done indexing 100000 documents.
Done indexing 125000 documents.
Done indexing 150000 documents.
Done indexing 175000 documents.
Done indexing 200000 documents.
Done indexing 225000 documents.
Done indexing 250000 documents.
reindex takes 86058 ms
SUMMARY: 0.004   3.49   250000  20      32      50      8 bits  9531    86058   1.00    post-filter

Results:
recall  latency (ms)    nDoc    fanout  maxConn beamWidth       quantized       visited index ms        selectivity     filterType
0.565    1.89   250000  20      32      50      4 bits  4610    43426   1.00    post-filter
0.820    1.62   250000  20      32      50      7 bits  4198    43593   1.00    post-filter
0.004    3.49   250000  20      32      50      8 bits  9531    86058   1.00    post-filter

mikemccand · 2024-09-09T14:01:42Z

Very exciting! I will try to review the code changes soon ... thanks @vigyasharma.

We use the passage search use-case with cohere embeddings created from wikipedia. Each parent document corresponds to a wikipedia article, and child documents correspond to paragraphs (chunk) within the article. Embeddings are only present for child documents.

How do I get the source (vectors) file input to run this?

mikemccand

Looks great, thanks you @vigyasharma! I'm very curious where/how I can get the parent/join meta file to try running this myself...

src/main/knn/KnnGraphTester.java

src/main/knn/KnnTesterUtils.java

src/main/knn/ParentJoinBenchmarkQuery.java

src/main/knn/KnnGraphTester.java

src/main/knn/ParentJoinBenchmarkQuery.java

src/main/knn/KnnTesterUtils.java

src/main/knn/KnnGraphTester.java

vigyasharma · 2024-09-09T19:35:09Z

Thanks for the prompt review @mikemccand

I'm very curious where/how I can get the parent/join meta file to try running this myself...

We can use the python src/python/infer_token_vectors_cohere.py script. We had merged in a change earlier (#283), to update the tool to create a metadata file as well.

python src/python/infer_token_vectors_cohere.py -d <num_docs> -q <num_queries>

mikemccand

I saw one accidental code block dup -- then let's merge!

src/main/knn/KnnGraphTester.java

vigyasharma · 2024-09-17T18:58:03Z

Resolved conflicts and merge duplication errors. I also like the new output from knnGraphTester with more graph details..

reindex takes 14.05 sec
Force merge index in knnIndices/cohere-wikipedia-docs-768d.vec-32-50-parentJoin.index
Force merge done in 12.76 sec
index has 1 segments
index disk uage is 295.02 MB
SUMMARY: 0.098  0.725   101323  10      6       32      50      no      9       14.05   12.76   1       295.02  1.00    post-filter
Leaf 0 has 4 layers
Leaf 0 has 101323 documents
Graph level=3 size=6, Fanout min=1, mean=2.67, max=4, meandelta=10062.31
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   2   3   3   3   3   3   4   4
Graph level=2 size=61, Fanout min=1, mean=7.54, max=16, meandelta=7024.34
%   0  10  20  30  40  50  60  70  80  90 100
    0   3   5   6   7   7   8   9  10  11  16
Graph level=1 size=2994, Fanout min=1, mean=4.51, max=32, meandelta=5549.65
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   5   9  13  32
Graph level=0 size=100000, Fanout min=1, mean=3.81, max=64, meandelta=3386.53
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   3   3   3   3   3   3  64
Graph level=3 size=6, connectedness=1.00
Graph level=2 size=61, connectedness=1.00
Graph level=1 size=2994, connectedness=1.00
Graph level=0 size=100000, connectedness=0.96

Results:
recall  latency (ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.098         0.725  101323    10       6       32         50         no    14.05          12.76             1           295.02

mikemccand · 2024-09-23T18:47:27Z

Thanks @vigyasharma -- this is an exciting improvement to KNN benchmarking!

vigyasharma added 5 commits September 6, 2024 10:29

support for parentJoins in benchmarks

5811271

clean up debug log lines

8b57eee

use labels to constants

2c15f58

parent join working

63543a2

restore default configs

c149e8f

vigyasharma mentioned this pull request Sep 6, 2024

[WIP] Multi-Vector support for HNSW search apache/lucene#13525

Open

vigyasharma and others added 3 commits September 7, 2024 10:03

Merge branch 'main' into pj2

4a49eb4

merge main into pj2

cfe8125

Merge branch 'main' into pj2

20256f6

mikemccand reviewed Sep 9, 2024

View reviewed changes

vigyasharma added 6 commits September 10, 2024 11:33

merge in changes from main

55e19c4

remove indexreader from parentJoin query

e828a40

docstring for ParentJoinBenchmarkQuery

399442a

fix condition styling

f7e6900

Merge branch 'main' into pj2

2fd7448

Merge branch 'main' into pj2

90f94f8

mikemccand approved these changes Sep 13, 2024

View reviewed changes

src/main/knn/KnnGraphTester.java Outdated Show resolved Hide resolved

vigyasharma added 3 commits September 17, 2024 11:34

remove dups from merges

93fa53b

Use TotalHits record

3706592

update TotalHits access to use record type in java

8d37090

mikemccand merged commit fb80610 into mikemccand:main Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParentJoin Benchmarks for KNN Search #296

ParentJoin Benchmarks for KNN Search #296

vigyasharma commented Sep 6, 2024 •

edited

Loading

mikemccand commented Sep 9, 2024

mikemccand left a comment

vigyasharma commented Sep 9, 2024

mikemccand left a comment

vigyasharma commented Sep 17, 2024

mikemccand commented Sep 23, 2024

ParentJoin Benchmarks for KNN Search #296

ParentJoin Benchmarks for KNN Search #296

Conversation

vigyasharma commented Sep 6, 2024 • edited Loading

Sample Run Results

mikemccand commented Sep 9, 2024

mikemccand left a comment

Choose a reason for hiding this comment

vigyasharma commented Sep 9, 2024

mikemccand left a comment

Choose a reason for hiding this comment

vigyasharma commented Sep 17, 2024

mikemccand commented Sep 23, 2024

vigyasharma commented Sep 6, 2024 •

edited

Loading