Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote Vector Index Build Component — Object Store Upload/Download #2392

Open
Tracked by #2391
jed326 opened this issue Jan 14, 2025 · 2 comments
Open
Tracked by #2391

Remote Vector Index Build Component — Object Store Upload/Download #2392

jed326 opened this issue Jan 14, 2025 · 2 comments
Assignees
Labels
Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label

Comments

@jed326
Copy link
Contributor

jed326 commented Jan 14, 2025

See #2391 for background information

Overview

Following up on the RFCs, this is the first part of the low-level design for the Vector Index Build Component. The Vector Index Build Component is a logical component we further split into 2 subcomponents and their respective responsibilities:

  1. Object Store I/O Component
    1. Upload flat vectors to Object Store
    2. Download graph file from object store
  2. Remote Vector Service Client Component
    1. Signal to Remote Vector Index Build Service to begin graph construction after vector files have been uploaded
    2. Receive a signal from Remote Vector Index Build Service to begin graph file download after graph construction is completed

This document contains the low level design for [1] Object Store I/O Component, covering how we can upload vectors and download graph files from a remote object store, as well as how we can configure the object store. The low level design for the remote vector service client is in a separate issue.

Alternatives Considered

The specific problem we are addressing in this design is [1] how to upload vectors to a remote object store from the vector engine and [2] how to download a graph file from a remote object store to the vector engine.

For discussion on high level architectural alternatives see: #2293

1. [Recommended] Integrate Repository Service with Vector Engine

This approach involves consuming the RepositoriesService in the k-NN plugin, which will then use the BlobContainer interface to read/write blobs to the remote repository.

Pros:

  • Uses existing repository interface to interact with remote object store, which is a well tested interface that has been around for a long time
  • Out of the box support for all existing supported repositories, so no specific implementation for S3, GCP, Azure is needed
  • This use case is well tested by remote store, where we write data to repositories not in snapshot format
  • Encryption comes “for free” as it already exists for snapshot and remote store repositories

Cons:

  • Will be harder to make any vector upload performance improvements specific to a certain object store implementation
  • Added complexity in bootstrapping vector repository

2. Custom object store upload/download client

Instead of using the existing interfaces, we build our own custom blob interface for uploading vectors and download graph files. Additionally we will also need to implement all of the object store clients.

Pros:

  • Complete control over client implementation / usage making it easy to make any needed performance improvements

Cons:

  • We will need to keep separate object store clients in the k-NN plugin, and we would be responsible for implementing all of the object store clients.
  • Encryption at rest will require special handling
  • We are duplicating much of the same functionality the repository interface already provides within the k-NN plugin

Repository Service Integration

Since the OpenSearch Plugin interface makes the RepositoriesService, the service responsible for maintaining and providing access to snapshot repositories, available to plugins, the k-NN plugin can consume this and use it to read/write to any supported remote object store.

High Level Class Diagram:
Image

Repository Configuration, Creation, & Validation

In order to use the RepositoriesService, we need to configure and create our own vector repository. For comparison, at a high level remote store does this by exposing node attribute settings (see docs) that are used to create and register a repository on the boot up of OpenSearch nodes and on the formation of the OpenSearch cluster. The key difference between our use case and remote store is that the vector repository does not need to be registered before system indices are created, meaning the vector repository creation can technically happen post node startup.

More specifically, remote store does the following (see Remote Store Repository Creation Design):

  1. Remote store repository configurations are consumed via node attributes set in opensearch.yml
  2. When TransportService is started, the repositories are created on each node
  3. Repositories are put into a map in RepositoriesService. Repositories are not yet registered as ClusterService is not started at this point
  4. ClusterService is started after transport service
  5. On cluster state change event the repository is registered after validating repository settings. Repository registration will only happen once during cluster formation, subsequent node boots just perform validation.

For our use case the problem here is the map in [3] is not extensible for plugins as there is an explicit isEmpty check on it. Therefore if we use the same logic from RemoteStoreNodeService#createAndVerifyRepositories to register our vector repository this would interfere with remote store registering it’s own repositories.

Given this problem, the following are the possible ways we can configure, create, and validate a vector repository from the k-NN plugin.

  1. [Recommended] Do not bootstrap the vector repository from k-NN Plugin at all. Instead, users will configure a vector repository on their own and use a cluster setting to indicate the name of the repository they have set up.
    1. Pros
      1. We do not need to handle the above complexities of repository creation and validation from the k-NN Plugin
      2. This is not a one way door, and we can in parallel work with remote store team to support a node attributes method of registering the repository to be rolled out at a later time
      3. Conceptually, it should not be up to the plugin to figure out how to wire up a repository and instead the notion of “System Repositories” should be pluggable in the same way “System Indices” are pluggable
    2. Cons
      1. Added complexity for open source users to configure/use the feature
  2. Configure repository with node attributes similar to remote store, then create/validate repository in ClusterPlugin#onNodeStarted
    1. Pros
      1. This is the very last callback for plugins, so this guarantees the RepositoryService and ClusterService is started beforehand
      2. Repository registration would only occur once during cluster formation at which time there should not be any vector index merge/flush operations.
      3. As mentioned before, unlike remote store we do not have the requirement of the vector repository being registered before system index creation.
      4. We will need failure handling for if the vector repository does not exist regardless, so any graph builds that encounter a missing repository exception will be already gracefully handled to fall back to the local CPU path
    2. Cons
      1. This is called after the node starts accepting traffic
      2. Repository validation may still happen after the node already begins uploading vectors for a flush/merge, which makes the failed repository validation scenario difficult to handle.
  3. Configure repository with node attributes similar to remote store, then create/validate repository in Plugin#getBootstrapChecks
    1. Pros
      1. Same as [2]
      2. If the vector repository validation/creation fails, this will fail the node bootup as well
    2. Cons
      1. Creating the Vector repository is overloading the bootstrap check as it’s not intended to perform any cluster bootstrapping, only validate if the bootstrapping is completed successfully and correctly

Consume RepositoriesService in NativeEnginesKnnVectorsWriter

Since we want to perform blob upload/download during the merge/flush operations, the KnnVectorsWriter class needs to have a reference to the RepositoriesService in order to perform the upload/downloads. The following refactoring will be required:

  1. KNNCodecService signature will need to be updated to consume repo service
  2. KNN9120PerFieldKnnVectorsFormat signature will need to be updated to consume repo service

Vector Input Stream Conversion

The BlobContainer#writeBlob and BlobContainer#readBlob methods both take the data to be written in the form of an InputStream, so we will need to implement logic to buffer KNNVectorValues into an InputStream. Depending on object upload performance analysis and benchmarking this may require a follow-up deep dive to do this process more efficiently

In the POC where vectors are buffered 1 by 1, the transfer of ~1.6m 768 dimension vectors only takes ~1 minute to complete, so we can revisit the performance aspect here as needed.
See: 10M 768D dataset without source and recovery source(best time): GPU based Vector Index Build : POC with OpenSearch Vector Engine

Vector File Format

One of the key design choices will be how we format the vector file being written to the object store. Moreover, one of the key decisions from the RFC is we want the remote vector build service to be OpenSearch/Lucene version agnostic (see: link)
The following are possible ways we can format the uploaded vector file:

  1. [Recommended] Upload only raw vector binaries to the remote object store. All other graph build metadata is sent in the build API request, including vector type (byte, float, binary) and vector dimensions
    1. Pros: This is the simplest approach and allow maximum compatibility as we will not need to define (or enforce) any file format contracts.
    2. Cons: The amount of metadata may be large if a large number of parameters are needed for graph construction. Vector build request body will handle all request metadata.
  2. Upload a separate metadata file to the remote object store.
    1. Pros: We can simplify the graph build request to contain only bucket related information and not include any graph build related metadata.
    2. Cons: We need to define and enforce the file format to be used. May not be fully compatible with all future use cases and is basically taking a soft dependency on faiss.
  3. Upload the metadata to the same blob as the raw vector files.
    1. Pros: Same as [b] + we do not need to upload and track 2 separate files.
    2. Cons: Same as [b]

Lucence .vec format for reference

Blob Name & Blob Path

With the repository interface a single bucket will be used for all of the vector blobs on a given domain. Therefore we need to design the blob name and blob path in such a way to prevent key collision resulting in the same vector blob being concurrently written. For snapshots (and remote store), only the primary shard data is uploaded to the repository, so the segments of a snapshot are uploaded to indices//. For more details see: OpenSearch Blog on snapshot structure. However, since we want to support both segment replication and document replication, we also have to account for both the primary and replica shard of a given index performing graph builds at the same time. In other words adding to the file path is not sufficient for collision prevention.

For the blob name, the same shard may be performing flush/merge for multiple segments at the same time, so we can deduplicate on the segment name:

blobVectorName = segmentName + "_" + fieldname + ".blobvec"

Segment names are never re-used by Lucene, so we do not have to worry about a future segment having the same name (ref: Lucene docs)

For the blob path, we have a few options to choose from:

  1. Do not allow using the remote vector build service when replicas are configured for a document replication based index. Then, we can use a similar file path as snapshots:
blobVectorFilePath = <BASE_PATH>/vectors/<indexUUID>/<shardNumber>/blobVectorName

Pros
1. This is the simplest solution and we do not have to worry about any replica deduplication logic
2. From a cost perspective it doesn’t make sense for users to use the remote vector build service when replicas are enabled anyways
Cons
1. Not allowing the remote vector build service when replicas are configured is not a very intuitive user experience

  1. Add the node ID into the file path. For existing OpenSearch APIs which require potentially referencing multiple replicas, for example the _cluster/allocation/explain API, the node which the shard is currently assigned to is used as an identifier as multiple replicas for the same index can not be assigned to the same node.
blobVectorFilePath = <BASE_PATH>/vectors/<indexUUID>/<shardNumber>/<nodeId>/blobVectorName

Pros
1. This gracefully handles the replica key collision without randomly generating any new information
Cons
1. If the invariant of 1 replica per node ever changes then that will break this implementation. However, it is unlikely this would ever change.
2. In case of merge/flush operation failure and retry, there may be an edge case where the same segment name is re-created.

  1. [Recommended] Generate a uuid for each shard copy in the file name. Instead of using a nodeID like in [2], we can add a uuid to the blobName to distinguish between shard copies. Since the graph build is being done in a synchronous manner there’s no need to keep a map of uuid to shard copy anywhere for later lookup.
blobVectorFilePath = <BASE_PATH>/vectors/<indexUUID>/<shardNumber>/uuid + blobVectorName

Pros
1. In case multiple replicas per node are supported in the future, this will still work
2. Handled any edge cases where the same segment name may get recreated
Cons
1. The blob path/name becomes non-deterministic which may make it more difficult to debug issues and handle retries, especially if we want to move to an async graph build architecture later we would need to keep a mapping between the uuid and the shard.

The constructed graph file will be able to use the same blobVectorFilePath + blobVectorName, we can use a different file extension as there is a 1:1 mapping between the uploaded vector blob and the downloaded graph.

blobVectorName = segmentName + "_" + fieldname + ".blobgraph"

Feature Controls

This section covers the implementation details of how we can [1] enable and disable the feature and [2] configure thresholds with which we decide whether or not to use the remote vector build GPU fleet or use the local CPU build path.

Feature Enablement:
Given the recommended solution in Repository Configuration, Creation, & Validation, we will not use any node attributes for this feature.

  1. Dynamic cluster setting with the repository name to be used to upload vectors to.
  2. Dynamic cluster setting to enable/disable remote vector build feature
    1. This is for customers to stop using the remote vector build service if they do not want to use it anymore
  3. Dynamic Index setting to enable/disable the feature per field. This would override [2,3]
    1. This is to allow customers to only use GPU builds on a subset of the vector fields in a given index
    2. We can provide a NONE and ALL values to allow customers to disable remote GPU builds on entire indices

We also want to provide intelligent logic to automatically decide for the customer whether or not to use the remote GPU build feature. This will come in the form of some dynamic cluster and index settings for which we will provide smart default values based on benchmarking analysis.

  1. Cluster Settings:
    1. Minimum document count per segment
    2. Minimum Vector Dimension
  2. Index Settings, which would override cluster settings on a per-index basis
    1. Minimum document count per segment
    2. Minimum Vector Dimension

Metrics

This section will cover metrics specific to the vector blob upload and graph blob download. Other metrics related to triggering the vector build will be covered in Vector Index Build Component — Remote Vector Service Client. As we are dealing with only blob upload/download here, we can scope down the metrics to the following:

  1. Data Upload/Download Volume
  2. Data Upload/Download Time
  3. Upload/Download Success Count
  4. Upload/Download Failure Count
  5. Upload/Download Success Rate, computed from [3], [4]
  6. Specific repository implementations also provide their own metrics, for example for repository-s3

Today the k-NN stats API only supports cluster and node level stats, so we can gather these metrics on a cluster/node level and expose them via the k-nn stats API.

As a separate item we should explore supporting index/shard level k-nn stats as it would be valuable to see specifically which indices are using and benefiting the most from the remote vector build service.

Failure Scenarios

Similar to Metrics, this section will also specifically focus on the failure scenarios for blob upload/download. At a high level, we need to gracefully handle all failures to fall back to the CPU graph build path as we cannot leave the segment without a graph file.

Since we are integrating with the existing BlobContainer interface, both retries and exceptions are already well defined by the interface:

We will explore the additional failure scenarios related to vector graph build in the client design: Vector Index Build Component — Remote Vector Service Client

Performance Benchmarks

We will perform additional performance benchmarks related to blob upload/download following up on the initial POC numbers here In the POC where vectors are buffered 1 by 1, the transfer of ~1.6m 768 dimension vectors only takes ~1…, and based on benchmark results we can adjust vector input stream conversions as needed.

End to end and remote vector build client benchmarking will be covered in a separate document.

Future Optimizations

Below are some future optimizations we can look into based on performance analysis:

  • Cache vector graph when force merging to 1 segment
  • Serialization optimizations for blob upload/download
@jed326 jed326 changed the title HLD - Remote Object Store Upload/Download Remote Vector Index Build Component — Object Store Upload/Download Jan 14, 2025
@jed326 jed326 self-assigned this Jan 14, 2025
@jed326 jed326 added Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label labels Jan 14, 2025
@andrross
Copy link
Member

Thanks @jed326. Using the existing repository framework for this use case makes sense to me, compared to the alternative of implementing another object store abstraction. I also agree that the remote store use case is different, which led to the choice to define and register the remote store repository t node bootstrap time. This use case seems more similar to how snapshot functionality uses repositories, which allows them to be created and used after cluster creation.

Perhaps I'm missing it, but have we defined the high-level user flow on how these concepts will be used? How does a user configure a k-NN index to use the remote vector index build feature, and how do they tell that index to use a specific repository? How will a user create a repository, is it the same as today when a user creates a snapshot repository? For the repository being used here, can it also be used for snapshots? Will all the existing APIs (e.g. /_cat/repositories) work the same with these repositories?

@jed326
Copy link
Contributor Author

jed326 commented Jan 16, 2025

Thanks @andrross! The user flow on how to set up and configure the specific repository as well as which specific settings they can use to do so will be finalized in a low level design in the very near future. Currently I'm working on some POCs to verify that we actually can wire the RepositoryService into the KNN codec to read/write data after which I will publish these specific details.

Until then, here are my current thoughts at a high level:

How does a user configure a k-NN index to use the remote vector index build feature, and how do they tell that index to use a specific repository?

The vector repository will be shared across indices and we will use a cluster setting to indicate which repository is the vector repository. There will also be index/cluster settings to enable/disable the remote build feature.

How will a user create a repository, is it the same as today when a user creates a snapshot repository?

Yep, it's a "bring your own repository" model like snapshots/searchable snapshots.

For the repository being used here, can it also be used for snapshots?

I did not see a way to enforce the repository not being able to take snapshots, however we will keep the knn index data outside of the indices path so as to not interfere with snapshots.

Will all the existing APIs (e.g. /_cat/repositories) work the same with these repositories?

In short yes. Today these same APIs still work for the remote store repositories, however _cat/snapshots don't return anything on those repositories as the data is not written in snapshots format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label
Projects
Status: New
Status: Backlog
Development

No branches or pull requests

3 participants