Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Multi-Vector support for HNSW search #13525

Open
wants to merge 135 commits into
base: main
Choose a base branch
from

Conversation

vigyasharma
Copy link
Contributor

@vigyasharma vigyasharma commented Jun 26, 2024

Adds support for multi-valued vectors to Lucene.

In addition to max-similarity aggregations like parent-block joins, this change supports ColBERT style distance functions that compute interaction across all query and document vector values. Documents can have a variable number of vector values, but to support distance function computations, we require all values to have the same dimension.

This is a big change and I still need to work on tests (existing and new), backward compatibility, benchmarks and some code refactor/cleanup. Raising this early version to get feedback on the overall approach. I marked the PR with no commit tags.

Addresses #12313 .

.

Approach

We define a new "Tensor" field that comprises multiple vector values, and a new TensorSimilarityFunction to compute distance across multiple vectors (uses SumMax() currently). Node ordinal is assigned to the tensor value, giving us one ordinal per document. All vector values of a tensor field are processed together during writing, reading and scoring. They are passed around as a packed float[] or byte[] array with all vector values concatenated. Consumers (like the TensorSimilarityFunction) slice this array by dimension to get individual vector values.

Tensors are stored using a new FlatVectorStorage that supports writing/reading variable length values per field (allowing us to have a different number of vectors per tensor). We reuse the existing HNSW readers and writers. Each graph node is a tensor and maps to a single document. I also added a new codec tensor format, to allow both tensors and vectors to coexist. I'm not yet sure how to integrate with the quantization changes (separate later date change) and didn't want to force everything into a single format. Tensors continue to work with KnnVectorWriter/Reader and extend the FlatVectorWriter/Reader classes.

Finally, I named the field and format "Tensors" though technically these are only rank-2 tensors. The thought was that we might extend this field and format if we ever went for higher rank tensors support. I'm open to renaming based on community feedback.

.

Major Changes

The PR has a lot of files which is not practical to review. Here are the files with key changes. If we align on the approach, I'm happy to reraise separate split PRs with different changes.

  1. New fields and similarity function for tensors.
    1. lucene/core/src/java/org/apache/lucene/document/FieldType.java
    2. lucene/core/src/java/org/apache/lucene/document/KnnByteTensorField.java
    3. lucene/core/src/java/org/apache/lucene/util/ByteTensorValue.java
    4. lucene/core/src/java/org/apache/lucene/document/KnnFloatTensorField.java
    5. lucene/core/src/java/org/apache/lucene/util/FloatTensorValue.java
    6. lucene/core/src/java/org/apache/lucene/index/TensorSimilarityFunction.java
    7. lucene/core/src/java/org/apache/lucene/index/FieldInfo.java
    8. lucene/core/src/java/org/apache/lucene/index/FieldInfos.java
  2. Indexing chain changes
    1. lucene/core/src/java/org/apache/lucene/index/IndexingChain.java
    2. lucene/core/src/java/org/apache/lucene/index/VectorValuesConsumer.java
  3. Reader side changes to return a tensor reader for tensor fields
    1. lucene/core/src/java/org/apache/lucene/index/SegmentCoreReaders.java
    2. lucene/core/src/java/org/apache/lucene/index/SegmentReader.java
  4. A new tensor format in the codec
    1. lucene/core/src/java/org/apache/lucene/codecs/KnnTensorsFormat.java
    2. lucene/core/src/java/org/apache/lucene/index/CodecReader.java
  5. A new tensor scorer to work with multiple vector values
    1. lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatTensorsScorer.java
    2. lucene/core/src/java/org/apache/lucene/codecs/hnsw/DefaultFlatTensorScorer.java
  6. A Lucene99FlatTensorsWriter for writing in the new flat tensor format - lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatTensorsWriter.java
  7. A Lucene99FlatTensorsReader for reading the flat tensor format - lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatTensorsReader.java
  8. An HnswTensorFormat that uses FlatTensorFormat to initialize the flat storage readers/writers underlying HNSW reader/writer.
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswTensorsFormat.java
  9. Hnsw reader and writer changes to support tensor fields and similarity function
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java
    2. lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java
  10. Off Heap Byte and FloatTensorValues for use by scorers
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapByteTensorValues.java
    2. lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapFloatTensorValues.java
  11. Setup to read and write tensor data value offsets to support variable vector count per tensor. This uses a DirectMonotonicReader/Writer.
    1. lucene/core/src/java/org/apache/lucene/codecs/lucene99/TensorDataOffsetsReaderConfiguration.java
  12. Syntax sugar for tensor queries
    1. lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java
    2. lucene/core/src/java/org/apache/lucene/search/KnnByteTensorQuery.java
    3. lucene/core/src/java/org/apache/lucene/search/KnnFloatTensorQuery.java

.

Open Questions

  1. I like the type safety of a separate field and similarityFn. It avoids user traps around passing a single value VectorSimilarity for tensors. But it does add a bunch of extra code, and ctor changes across a lot of files. Some options to simplify could be:
    • Reuse vectorEncoding and vectorDimension attributes in FieldInfo instead of a separate tensor encoding and dimension
    • Also reuse VectorSimilarityFunction, but create a separate "tensor aggregator" that corresponds to SumMax.

@vigyasharma
Copy link
Contributor Author

Is "default run" from this PR?

No. "default run" is knn search where each embedding is a separate document with no relationship between them. I'm still wiring things up to see benchmark results for this PR.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Sep 27, 2024
@benwtrent
Copy link
Member

Hey @vigyasharma there is a lot of good work here.

I am going to shift my focus and see about how I can help here more fully. What are the next steps?

I am guessing handling all the merging from main, I can take care of that sometime next week.

Just wondering where I can help.

@github-actions github-actions bot removed the Stale label Oct 26, 2024
@vigyasharma
Copy link
Contributor Author

Thanks @benwtrent. I've been working on getting a multi-vector benchmark running to wire this end to end. Found some pesky bugs and oversights. I'm planning to split this feature into multiple smaller PRs. This PR was mainly to get inputs on the approach. It's too big to test and review. I'll share a plan of the split PRs soon.

re: the multi-vector benchmark for passage search use-case, I've been stuck on a bug where after I run into an EOFException on reading the last multi-vector document through DenseOffHeapMultiVectorValues. I could definitely use some help here. If you plan to take a look, you can use the code in this PR (i'll push my fixes) and multi-vector benchmark code from here.

Exception in thread "main" java.lang.RuntimeException: java.io.EOFException: read past EOF: MemorySegmentIndexInput(path="/Users/vigyas/forks/bench/util/knnIndices/cohere-wikipedia-docs-768d.vec-32-50-multiVector.index/_0_Lucene99HnswMultiVectorsFormat_0.vecmv") [slice=multi-vector-data]
        at knn.KnnGraphTester$ComputeBaselineNNFloatTask.call(KnnGraphTester.java:1115)
        at knn.KnnGraphTester.computeNN(KnnGraphTester.java:967)
        at knn.KnnGraphTester.getNN(KnnGraphTester.java:812)
        at knn.KnnGraphTester.run(KnnGraphTester.java:438)
        at knn.KnnGraphTester.runWithCleanUp(KnnGraphTester.java:177)
        at knn.KnnGraphTester.main(KnnGraphTester.java:172)
Caused by: java.io.EOFException: read past EOF: MemorySegmentIndexInput(path="/Users/vigyas/forks/bench/util/knnIndices/cohere-wikipedia-docs-768d.vec-32-50-multiVector.index/_0_Lucene99HnswMultiVectorsFormat_0.vecmv") [slice=multi-vector-data]
        at org.apache.lucene.store.MemorySegmentIndexInput.readByte(MemorySegmentIndexInput.java:146)
        at org.apache.lucene.store.DataInput.readInt(DataInput.java:95)
        at org.apache.lucene.store.MemorySegmentIndexInput.readInt(MemorySegmentIndexInput.java:261)
        at org.apache.lucene.store.DataInput.readFloats(DataInput.java:202)
        at org.apache.lucene.store.MemorySegmentIndexInput.readFloats(MemorySegmentIndexInput.java:231)
        at org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues.vectorValue(OffHeapFloatMultiVectorValues.java:111)
        at org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues.vectorValue(OffHeapFloatMultiVectorValues.java:130)
        at org.apache.lucene.codecs.hnsw.DefaultFlatMultiVectorScorer$FloatMultiVectorScorer.score(DefaultFlatMultiVectorScorer.java:185)
        at org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues$DenseOffHeapMultiVectorValues$1.score(OffHeapFloatMultiVectorValues.java:248)
        at org.apache.lucene.search.AbstractKnnVectorQuery.exactSearch(AbstractKnnVectorQuery.java:220)
        at knn.KnnFloatVectorBenchmarkQuery.exactSearch(KnnFloatVectorBenchmarkQuery.java:33)
        at knn.KnnFloatVectorBenchmarkQuery.runExactSearch(KnnFloatVectorBenchmarkQuery.java:50)
        at knn.KnnGraphTester$ComputeBaselineNNFloatTask.call(KnnGraphTester.java:1111)
        ... 5 more

@vigyasharma
Copy link
Contributor Author

it seems like single vector is a special form of multi-vector

re: single v/s multi-vectors, I think it makes sense to not force users to chose multi-valued fields upfront. There's value in being able to go from single to multi-values when the need arises (and treating single-vectors like a storage optimization).

However, I do think that we should not support changing the aggregation function once it has been set. Allowing different aggregate functions per segment will make merging and general debugging overly complicated.

As such, I'm thinking of keeping the Aggregation as part of FieldInfo. There won't be a separate multi-vector field. When we create a vector without specifying the aggregation, the default gets set to NONE and only single values are allowed. Lucene validation will support going from NONE to a different value, at which point, the field is treated as multi-vector. However, changing the Aggregation once it is set to anything other than NONE will not be supported.
How does this sound?

@jimczi
Copy link
Contributor

jimczi commented Oct 28, 2024

it seems like single vector is a special form of multi-vector

The solution really depends on the semantics. In its current form, the way multi-vectors are incorporated in this PR doesn’t quite extend the single-vector case. With max similarity, we assume that each similarity score results from a full comparison, which works well when the operations are limited (such as in re-ranking scenarios). However, for ColBERT, where the average number of vectors per document is large (in the hundreds or thousands), using HNSW with max similarity layered on top may not be the optimal approach. This is likely why other vector libraries don’t expose this setup.

If our aim is to introduce max similarity in Lucene, we might need to explore a more effective strategy. Although nested vectors could be promising, they’re currently constrained by the 2B vector limit, which isn’t ideal for ColBERT, given that each input token is represented as a dense vector. The primary limitation with HNSW and the knn codec today seems to be this 2B cap on vectors.

Given these factors, we may want to reconsider HNSW for this purpose. A scalable solution would likely involve running multiple queries (one per query vector) rather than relying on an aggregation strategy. Maybe the first goal should be to incorporate max sim for re-ranking use cases first using a flat format?

@vigyasharma
Copy link
Contributor Author

Hi @jimczi , The main change in this PR is support for multi-vectors in flat readers and writers, along with a similarity spec for multiple vector values.

It is possible that HNSW is not the ideal data structure to expose multi-vector ANN. We don't really change much in hnsw impl, except using multi-vector similarity for comparisons (graph build and search). Users can use the PerFieldKnnVectorsFormat to wire different data structures on top of the flat multi-vector format. We can also provide something off the box in a subsequent change. I think the aggregation fn. interface is also flexible for different types of similarity implementations?

Notably, this change maps all vector values for a document to a single ordinal. This gets us past the 2B vector limit (which I like), but also reads all vector values for the document whenever fetched. I can't think of a case where we'd only like partial values, but if we do, perhaps we can handle it in the similarity/aggregate functions.

@vigyasharma
Copy link
Contributor Author

Maybe the first goal should be to incorporate max sim for re-ranking use cases first using a flat format

This could be setup using 1) a single-vector field for hnsw matching, and 2) a separate field with multi-vector values to directly access the flat format for the subset of matched readers. Basically a RandomAccessVectorValues on the format. I think this PR will allow us to support such a setup, may need a small change to turn off hnsw graph creation for specific fields (via PerFieldKnnVectorsFormat)?

@vigyasharma
Copy link
Contributor Author

As mentioned earlier, here is my rough plan for splitting this change into smaller PRs. Some of these steps could be merged if the impl. warrants it:

  1. Multi-Vector similarity and aggregation classes.
  2. FieldInfo changes to add a new attribute for "aggregation". This will be set to NONE for single-valued vectors, by default, and in formats prior to this change.
  3. Multi-vector support to flat vectors writer.
  4. Random access vector values for multi-vectors.
  5. Multi-vector support to flat vectors reader.
  6. Hnsw writer/reader changes to work with multi-vectors if configured.
  7. Support to index and query multi-vector values (may need to add this with the flat writer/reader PRs).

@jimczi
Copy link
Contributor

jimczi commented Oct 29, 2024

The more I think about it, the less I feel like the knn codec is the best choice for this feature (assuming that this issue is focused on late interaction models).

It is possible that HNSW is not the ideal data structure to expose multi-vector ANN. We don't really change much in hnsw impl, except using multi-vector similarity for comparisons (graph build and search). Users can use the PerFieldKnnVectorsFormat to wire different data structures on top of the flat multi-vector format. We can also provide something off the box in a subsequent change. I think the aggregation fn. interface is also flexible for different types of similarity implementations?

Using the knn codec to handle multi-vectors seems limiting, especially since it treats multi-vectors as a single unit for scoring. This works well for late interaction models, where we’re dealing with a collection of embeddings, but it’s restrictive if we want to index each vector separately.
Using the original max similarity for HNSW is just not practical, it doesn’t scale, and I don’t think it’s something we’d actually want to support.

It could be helpful to explore other options instead of relying on the knn codec alone. Along those lines, I created a quick draft of a LateInteractionField using binary doc values, which keeps things simple and avoids major changes to the knn codec. I don’t think the flat vector format really offers any advantages over using binary doc values. In both cases, we’re able to store plain dense vectors as bytes, so there doesn’t seem to be a clear benefit to using the flat format here.

What do you think of this approach? It feels like we could skip the full knn framework if our main goal is just to score a bag of embeddings. This would keep things simpler and allow us to focus specifically on max similarity scoring without the added weight of the full knn codec.

My main worry is that adding multi-vectors to the knn codec as a late interaction model might add complexity later. It’s really two different approaches, and it seems valuable to keep the option for indexing each vector separately. We could expose this flexibility through the aggregation function, but that might complicate things across all codecs, as they’d need to handle both aggregate and independent cases efficiently.

@vigyasharma
Copy link
Contributor Author

One use-case for multi-vectors is indexing product aspects as separate embeddings for e-commerce search. At Amazon Product Search (where I work), we'd like to experiment with separate embeddings to represent product attributes, user product opinions, and product images. Such e-commerce use-cases would have a limited set of embeddings, but leverage similarity computations across all of them.

I see your point about scaling challenges with very high cardinality multi-vectors like token level ColBERT embeddings. Keeping them in a BinaryDocValues field is a good idea for scoring only applications. I like the LateInteractionField wrapper you shared, we should bring it into Lucene for such usecases.

However, I do think there is space for both solutions. It's not obvious to me how the knn codec gets polluted for future complexity. We would still support single vectors as is. My mental model is: if you want to use multi-vectors in nearest neighbor search (hnsw or newer algos later), index them in the knn field. Otherwise, index them separately as doc-values used only for re-ranking top results.

@benwtrent
Copy link
Member

One use-case for multi-vectors is indexing product aspects as separate embeddings for e-commerce search. At Amazon Product Search (where I work), we'd like to experiment with separate embeddings to represent product attributes, user product opinions, and product images. Such e-commerce use-cases would have a limited set of embeddings, but leverage similarity computations across all of them.

This seems like just more than one knn field, or the nested field support.

But, I understand the desire to add a multi-vector support to the flat codecs. I am honestly torn around whats the best path forward for the majority of users in Lucene.

@vigyasharma
Copy link
Contributor Author

I tried to find some blogs and benchmarks on other library implementations. Astra Db, Vespa, faiss and nmslib, all seem to support multi-vectors in some form.

From what I can tell, Astra Db and Vespa have ColBERT style multi-vector support in ANN [1] [2]. Benchmarks indicate ColBERT outperforms other techniques in quality, but full ColBERT on ANN has higher latency [3]. For large scale applications, users seem to overquery on ANN with single vector representations, and rerank them with ColBERT token vectors [4]. However, there's also ongoing work/research on reducing the no. of embeddings in ColBERT, like PLAID which replaces a bunch of vectors with their centroids [5].

...

I am honestly torn around whats the best path forward for the majority of users in Lucene.

I hear you! And I don't want to add complexity only because we have some body of work in this PR. Thanks for raising the concern Jim, it led me to some interesting reading.

...

My current thinking is that this is a rapidly evolving field, and it's early to lean one way or another. Adding this support unlocks experimentation. We might add different, scalable, ANN algos going forward, and our flat storage format should work with most of them. Meanwhile, there's research on different ways to run late interaction with multiple but fewer vectors. This change will help users experiment with what works at their scale, for their cost/performance/quality requirements.

I'm happy to change my perspective, and would like to hear more opinions. One reason to not add this would be if it makes the single vector setup hard to evolve. I'd like to understand if (and how) this is happening, and think on how we can address those concerns.
...

1: https://docs.datastax.com/en/ragstack/examples/colbert.html
2: https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/
3: https://thenewstack.io/overcoming-the-limits-of-rag-with-colbert/
4: https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/
5: PLAID - https://arxiv.org/abs/2205.09707

@krickert
Copy link

krickert commented Nov 9, 2024

My current thinking is that this is a rapidly evolving field, and it's early to lean one way or another. Adding this support unlocks experimentation.

Amen!

This ends up being so domain specific. Multi-embeddings become key when you deal with domain voids in the LLMs used to create the embeddings. That's most big corpuses. So at least being able to experiment would get you far more feedback.

I would be ok with writing some tests if that helps.

@jimczi
Copy link
Contributor

jimczi commented Nov 15, 2024

One reason to not add this would be if it makes the single vector setup hard to evolve. I'd like to understand if (and how) this is happening, and think on how we can address those concerns.

I believe we should carefully consider the approach to adding multi-vector support through an aggregate function. From the outset, we assume that multi-vectors should be scored together, which is an important principle. Moreover, the default aggregate function proposed in the PR relies on brute force, which is not practical for any indexing setup.

My concern is that this proposal doesn’t truly add support for independent multi-vectors. Instead, it introduces a block of vectors that must be scored together, which feels like a workaround rather than a comprehensive solution. This approach doesn’t address the key challenges of implementing true multi-vector support in the codec.

The root issue is that the current KNN codec assumes the number of vectors is bounded by a single integer, a limitation that needs to be addressed first. Removing this constraint is a complex task but essential for properly supporting multi-vectors. Once that foundation is in place, adding support for setups like ColBERT should become relatively straightforward.

Finally, while the max-sim function proposed in this PR may work as a ranking function, it isn’t suitable for indexing any documents. A true solution should allow for independent multi-vectors to be queried and scored flexibly without these constraints.

@vigyasharma
Copy link
Contributor Author

My concern is that this proposal doesn’t truly add support for independent multi-vectors.

That's a valid concern. I've been thinking about a more comprehensive multi-vector solution. Sharing some raw thoughts below, would love to get feedback.

We support a default aggregation value of NONE, which builds the graph with independent multi-vectors. Each node will be a separate vector value. As a first change, we can just support this without creating an aggregation enum. (Adding a plan for indexing this in a follow-up comment).

Once this is in place, we can add support for "dependent" multi-vector values like ColBERT. They'll take an aggregation function. Each graph node will represent all vectors for a document and use aggregated similarity (like in this PR). This will let us experiment with full ANN on ColBERT style multi-vectors.

@vigyasharma
Copy link
Contributor Author

...contd. from above – thoughts on supporting independent multi-vectors specified via NONE multi-vector aggregation...
__

The Knn{Float|Byte}Vector fields will accept multiple vector values for documents. Each vector value will be uniquely identifiable by a nodeId. Vectors for a doc will be stored adjacent to each other in flat storage. KnnVectorValues will support APIs for 1) getting docId for a given nodeId (existing), 2) getting vector value for a specific nodeId (existing), 3) getting all vector values for the document corresponding to a nodeId (new).

Our codec today has single unique sequentially increasing vector ordinal per doc, which we can store and fetch with the DirectMonotonicWriter. For multi-vectors, we need to handle multiple nodeIds mapping to a single document.

I'm thinking of using "ordinals" and "sub-ordinals" to identify each vector value. 'Ordinal' is incremented when docId changes. 'Sub-ordinals' start at 0 for each new doc and are incremented for subsequent vector values in the doc. A nodeId in the graph, is a "long" with ordinals and sub-ordinals packed into MSB and LSB bits separately.

For flat storage, we can continue to use the technique in this PR; i.e. have one DirectMonotonicWriter object for docIds indexed by "ordinals", and another that stores start offsets for each docId, again indexed by ordinals. The sub-ordinal bits help us seek to exact vector values from this metadata.

int ordToDoc(long nodeId) {
  // get int ordinal from most-significant 32 bits
  // get docId for the ordinal from DirectMonotonicWriter
}

float[] vectorValue(int nodeId) {
  // get int ordinal from most-significant 32 bits
  // get "startOffset" for ordinal
  // get subOrdinal from least-signifant 32 bits
  // read vector value from startOffset + (subOrdinal * dimension * byteSize)
}

float[] getAllVectorValues(int nodeId) {
  // get int ordinal from most-significant 32 bits
  // get "startOffset" for ordinal
  // get "endOffset" from offset value for ordinal + 1
  // return values from [startOffset, endOffset)
}

With this setup, we won't need parent-block join queries for multiple vector values. And we can use getAllVectorValues() for scoring with max or avg of all vectors in the doc at query time.

I'm skeptical if this'll give a visible performance boost. It should at least be similar to the block-join setup we have today, but hopefully more convenient to use. And it sets us up for "dependent" multi-vector values like ColBERT.

We'll need to code this up to iron out any wrinkles. I can work on a draft PR if the idea makes sense.
__

Note that this still doesn't allow >2B vector values. While the "long" nodeId can support it, our ANN impl. returns arrays containing all nodeIds is various places. I don't think java can support >2B array length. But we can address this limitation separately, perhaps with a different ANN algo for such high cardinality graphs.

@krickert
Copy link

And we can use getAllVectorValues() for scoring with max or avg of all vectors in the doc at query time.

Your proposal to implement getAllVectorValues() for scoring documents by aggregating their vectors (using methods like max or average) at query time has a lot of use cases and think it's a great idea. However, in my domain-specific data, this approach hasn't enhanced search results. However, providing a default implementation, as you suggested, with the option for customization, could be beneficial.

(sidenote: if you are doing max/average, you can do that during index time though, right?)

I'm currently conducting A/B tests on three methods to retrieve and rank documents with multiple vectors:

  1. Aggregate Scoring: Computing a single relevance score per document by aggregating all its vectors. Flexibility in the aggregation method would help me a lot.
  2. Chunk-Based Highlighting: Treating each vector as a distinct document chunk to facilitate highlighting. This involves returning the top N documents so K would be more dynamic based on aggregate scores, with each document potentially containing multiple relevant sections - which would make K a bit more dynamic because we want the top N documents and K would represent the chunks that represent those documents. Implementing thresholds per-doc can help manage performance.
  3. Custom Aggregation with Embedding Tags: Associating vectors with specific tags, such as user access levels or n-gram embeddings, to enable dynamic aggregation strategies. This allows for personalized and context-sensitive relevance scoring and would require the ability to override/customize.

The third approach is particularly promising for domain-specific applications, where standard aggregation methods may not suffice. For instance, embedding tags could be linked to user access controls, unlocking certain vectors at query time, or to specific n-grams, activating them based on query content.

Incorporating a mechanism to override the default aggregation method would facilitate experimentation with these strategies.

@vigyasharma
Copy link
Contributor Author

Thank you for sharing these use-cases @krickert !

  1. Aggregate Scoring – I think we can do this today by joining the child doc hits with their parents and calculating score on all children in the ToParentBlockJoinQuery. The getAllVectorValues() api should make this easier by avoiding the two phase query. We can also use the aggregate query scores during approximate search graph traversal itself (use aggregate query similarity with all vector values for the doc)?

  2. Chunk-Based Highlighting – Interesting. With getAllVectorValues(), we can find all vector values with similarity above a separate sim-threshold for highlights?

  3. Custom Aggregation with Embedding Tags – I think this one plays better with a separate child doc per vector value. We can store these tags and access related data as separate fields in child docs and filter on them during search.

Honestly, I think the existing parent-block join can achieve most use-cases for independent multi-vectors (the passage vector use case). But the approach above might make it easier to use? We also need it for dependent multi-vectors like ColBERT, though it's a separate question on whether ANN is even viable for ColBERT (v/s only for reranking).

I'd like to know what issues or limitations do people face with the existing parent-child support for multiple vector values, so we can address them here.

@krickert
Copy link

Chunk-Based Highlighting – Interesting. With getAllVectorValues(), we can find all vector values with similarity above a separate sim-threshold for highlights?

Not sure. But it is frustrating for me: we only calculate K chunks and not N documents. I want to return N documents all the time, and keep running K until N is reached. Since it runs K on the chunks, I'd rather it return all thee chunks that it can until it reaches N amount of documents. Then we can return the chunks that match which can be used by highlighting.

I think this one plays better with a separate child doc per vector value. We can store these tags and access related data as separate fields in child docs and filter on them during search.

Indexing the child docs requires making more docs. We just care about the resulting embedding, so why not treat it like a tensor instead of an entire document? It's frustrating to always make a child doc for multiple vectors when I can just do a keyword-value style instead. Also, there's def some limitations with how you can use it with scoring and the query ends up looking like a mess. If we can simplify the query syntax that would help a lot.

If you can get a unit test going for your PR, I'd be glad to expand on it and play with it a bit.

Copy link

github-actions bot commented Dec 5, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants