Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Remote Vector Index Build Feature with OpenSearch Vector Engine #2294

Open
navneet1v opened this issue Nov 28, 2024 · 0 comments
Open
Assignees
Labels
enhancement indexing indexing-improvements This label should be attached to all the github issues which will help improving the indexing time.

Comments

@navneet1v
Copy link
Collaborator

navneet1v commented Nov 28, 2024

Introduction

As part of RFC : Boosting OpenSearch Vector Engine Performance using GPUs, we proposed the idea of using GPUs to accelerate the vector index build time. In this RFC we are proposing the high level design of a generic Remote Vector Index Build capability in Vector Engine, which will be used to connect this remote GPUs based Index Build fleet.

Requirements

  1. OpenSearch Vector Engine should able to use a remote index build service to create the k-NN Vector index.
  2. A user/operator/distribution provider should be able to configure the remote vector index build endpoint and also should be able to configure other details required for vector index build at cluster level/index level.

Proposed Vector Engine Architecture to integrate with Remote Vector Index Build Service

Assumptions

Below are some of the assumptions taken while designing the integration of Vector Engine with remote Index Build Service.

  1. This design assumes that there is an IndexBuildService hosted at a endpoint where Vector Engine can submit the request of create Index.
  2. Vector engine has all the details on how to connect to object store to stream vectors and download index.
  3. Add what will be provided from customer
  4. The building of an index via a remote endpoint will supported with indices that uses NativeKNNVectorsFormat and with old indices which uses DocValuesFormat(old indices format due to the limitations on access of different index/mapping related attributes at the codec level)

Roles and Responsibilities

  1. Given the details of the object store, upload vectors data to the store and download the index once it is created.
  2. Given any endpoint for Index build service call the remote endpoint to create the vector index per segment.
  3. Have intelligent logic to take a decision when to use Index build service endpoint vs Local compute to build the index.
  4. Once the index is built download the index from remote store and put it along with other segment related files.

Overall Flow/High Level Design

Local Index Build Flow(Old Flow)

  1. Users will still keep on creating the index with knn_vector field mapping as usual but with one change users will provide the hint to Vector engine whether user is interested in using externally hosted index build service along with other required details like external endpoint etc to build the vector index. This choice will be more of a dynamic choice which can be updated after index creation. The exact customer experience is not defined and will be added in upcoming proposals.
  2. Users will still same indexing api to ingest the documents with vectors into Opensearch. The document will follow the same process till the KNNVectorsWriter is hit in KNNCodec. This KNNVectorsWriter will keep on accumulating the vectors in RAM buffers and calls the NativeIndexBuilder component during Lucene Flush.
    1. Flush/Merge: During flush/merge NativeIndexBuilder component is called to build the Vector Search Data structures. In the new flow, NativeIndexBuilder component will still build the params for index build but at this point the vector engine will take a decision to choose where index should be build.
  3. For CPU based index build LocalIndexBuild component is called which will use the Native libraries installed on the data nodes to build the index.
  4. The index built locally now will the be serialized and stored as a Lucene segment file and will be tracked by Lucene just like any other file.

Remote Index Build Flow:

  1. Step 1 and 2 will remain same as mentioned above.
  2. If Vector Engine decides to use the external hosted Index Build Service to build the index, Remote IndexBuild component will stream the vectors to Object Store(details on how to speed up this process and other optimizations will be added in upcoming proposals) and hit the createIndex API of IndexBuildService.
    1. Once the index is build IndexBuildService will upload the index on a specific destination will notify the VectorEngine and Vector Engine will then download the index on the data node.
  3. The index built by new flow or current flow will the be stored as a Lucene segment file and will be tracked by Lucene just like any other file.

New Components Definition

  1. Remote Index Build Component: This component will be responsible for handling the full flow of index build remotely. The component will not hold any business logic on when to use remote index build but it will just be a manager handling the flow. More low level details related to component(like how many streaming threads to use, what should be the chunk size of upload, what object store to be interacted with) will be added in upcoming proposals.

Alternatives Considered

Use flat vectors files stored in segment(for remote store cases) rather streaming vectors to and fro using object store

This is an interesting alternative and would potentially avoid heavy lifting of transferring of vectors to object store but this approach has some feasibility challenges:

  1. Segments files of a lucene directory, doesn’t get persisted to directory storage till the finish and close functions are not called on the IndexInput. Ref1, ref2.
  2. Since Lucene serialize the FlatVectors file in its own internal format which can change, hence taking a dependency on Lucene flat vectors format in IndexBuildService will lead to having version dependency between IndexBuildService and Opensearch Vector Engine, which we really want to avoid so IndexBuildService can evolve and release independently of Opensearch.

Next Steps

  1. Low Level Design: A new RFC focusing on the LLD of the Remote Index Build Component proposed in this RFC.
  2. Remote Index Build Client: To integrate with the remote Vector Index Build Component/service and object store we will add a simple client abstraction with the standard defined APIs for integration.

Open Questions

  1. In case of document replication, will replicas be building their own vector indices? Since the number of segments can differ between primaries and replicas for document replication, hence replicas needs to trigger their own index build. But if replicas also use the index build service then the cost of the overall solution will increase. Hence I think the index build acceleration should be used ideally with segment replication. But users will be free to use with document replication too.
  2. What will be the notification mechanism by which OpenSearch Vector Engine will get to know that Index is ready for download? More details will be added as part of upcoming RFC

Appendix

  1. Meta Issue: [META]: Boosting Opensearch Vector Engine Performance using GPUs #2295
  2. Main RFC: RFC : Boosting OpenSearch Vector Engine Performance using GPUs
  3. Similar functionality request in k-NN from a user: [FEATURE] Is possible to add new KNNs using microservices? That broadens the KNNs that can be used as backends #1328
@navneet1v navneet1v added untriaged enhancement indexing indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. and removed untriaged labels Nov 28, 2024
@navneet1v navneet1v moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Nov 28, 2024
@navneet1v navneet1v self-assigned this Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement indexing indexing-improvements This label should be attached to all the github issues which will help improving the indexing time.
Projects
Status: Backlog (Hot)
Development

No branches or pull requests

1 participant