Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Separation of compute & storage #14637

Open
bryanlb opened this issue Jul 3, 2024 · 7 comments
Open

[RFC] Separation of compute & storage #14637

bryanlb opened this issue Jul 3, 2024 · 7 comments
Labels
enhancement Enhancement or improvement to existing feature or request Other RFC Issues requesting major changes Roadmap:Modular Architecture Project-wide roadmap label Storage Issues and PRs relating to data and metadata storage

Comments

@bryanlb
Copy link
Contributor

bryanlb commented Jul 3, 2024

Introduction

Modern distributed search engines like OpenSearch, ElasticSearch and Apache Solr were not designed from the ground up to truly take advantage of the cloud’s elasticity, neither were they built to leverage building blocks that have become foundational pieces in public cloud provider offerings like object storage.

Our proposal is to modify OpenSearch to adopt a cloud native architecture, separating compute from durability and storage. The durability of unindexed data would be provided by a persistent queue like Apache Kafka, and the storage for indexed data would be provided by object storage like Amazon S3.

This also enables alternate architectures, such as a deployment that does not keep a hot tier of data nodes, but use cold storage that streams results directly from object storage.

Goals

  • Cluster elasticity - by using object storage and removing the need for replicating data between OpenSearch nodes it provides the ability to scale up and scale down easily.
  • Selective resource allocation - compute resources can selectively allocated depending on the nature of the workload, enabling additional query throughout, ingest capacity, or both as needed.
  • Increase scale - enable easily scaling OpenSearch deployments into the 10s or 100s of Petabytes or more.
  • Increase cluster resilience - nodes can be quickly replaced, without impacting the availability of the cluster.

Non-Goals

  • Reduced operating cost - while beneficial, this should not be seen as a primary motivation for adopting this architecture.

Proposed architecture

We propose moving from the existing cluster model architecture to a stateless node architecture, using an event stream / write ahead log for unindexed data, and using object storage as the durable storage.

Ingest nodes - accept bulk ingest requests, submit to an event stream
Event stream - Apache Kafka, Apache Pulsar, etc used as a write head log
Indexing nodes - consume from the event stream to create indexes and upload to object storage
Data nodes - fetch indexes from object storage and make available to query. Optionally can stream indexes directly from object storage.
Coordinating nodes - perform scatter / gather of from data nodes
Metadata store - Apache Zookeeper, etc used as a centralized store for node, index discovery
Manager node - manages operation of the cluster

flowchart LR
  ingest(ingest node) -- event stream --> index 
  index(index node) --> objectstore
  objectstore{{object store}} --> datanode

  coord(coordinating node) <--> datanode
  datanode(data node)
Loading

Indexers and data nodes all communicate via a cluster manager and do not replicate any data between themselves.

Discussion

  • Should ingest nodes be queryable? If they are not queryable this may necessitate the introduction of a real-time node that can make results available quicker, potentially also bypassing the event stream.
  • Should data nodes perform both hot and frozen searches? Should this be responsibility be split into separate node types dedicated to their respective functions?
  • In this approach some mapping issues may not be discovered at ingest, and only caught during indexing. Would this be a problem for most users, and how should this be handled?

Summary

We believe moving towards a stateless node architecture will enable operators of OpenSearch deployments to more quickly adapt to changing workload requirements, improve cluster resource utilization, and enable scaling to larger deployments.

References

Slack Astra Search Engine - https://slackhq.github.io/astra/architecture.html#system-overview
The Snowflake Elastic Data Warehouse - https://dl.acm.org/doi/10.1145/2882903.2903741

Proposal co-authored by @vthacker and @bryanlb for @slackhq

@bryanlb bryanlb added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 3, 2024
@github-project-automation github-project-automation bot moved this to Issues and PR's in OpenSearch Roadmap Jul 3, 2024
@github-actions github-actions bot added the Other label Jul 3, 2024
@Pallavi-AWS
Copy link
Member

Thanks @bryanlb for creating this RFC on stateless node architecture. We will join hands with the work going on for reader/writer separation under #14596 (cc: @sohami @andrross @mch2 @getsaurabh02 @msfroh)

@getsaurabh02
Copy link
Member

getsaurabh02 commented Jul 3, 2024

Thanks @bryanlb for starting this RFC. Its well-structured proposal, highlighting the significant benefits of separating compute and storage. It aligns with the Reader and Writer Separation RFC, which also advocates for dedicated node roles, moving us in the same direction.

The high-level goals, such as traffic segregation, separation of concerns for resilience, and independent scalability, are substantial. Ability to scale independently adds significant value from an infrastructure perspective, allowing the use of heterogeneous instance types for different node roles. Additionally, this architecture enables us to tackle more complex problems going forward, such as implementing independent sharding schemes for readers and writers based on traffic patterns (or shard heat). Also, performing post-processing tasks like creating rollups or high-level pre-compute caches/indices for improved read performance can be achieved in better isolation.

The use of object storage for indexed data and a persistent queue like Apache Kafka for unindexed data ensures durability and scalability. It also addresses the indexing scale problem in today's world. With Pull based indexing approach, we can dynamically allocate resources based on workload characteristics, which will help handling varying query loads and ingest rates.

Furthermore, revamping the metadata store should be broadly considered in both proposals. It's also an opportunity to segregate the cluster state with more concise and relevant information based on node roles.

One thing to consider is the potential increase in read after write latency, especially when fetching indexes from object storage? It might be worth to think what strategies can we employ to optimize the performance of real-time queries in this new architecture?

@kogent
Copy link

kogent commented Jul 3, 2024

One thing to consider is the potential increase in read after write latency, especially when fetching indexes from object storage? It might be worth to think what strategies can we employ to optimize the performance of real-time queries in this new architecture?

i think that is called out in the question from the proposal:

Should ingest nodes be queryable? If they are not queryable this may necessitate the introduction of a real-time node that can make results available quicker, potentially also bypassing the event stream.

@getsaurabh02
Copy link
Member

Adding @msfroh @andrross @sohami @mch2 for their feedback/comments.

@peternied peternied added RFC Issues requesting major changes Storage Issues and PRs relating to data and metadata storage and removed untriaged labels Jul 10, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5]
@bryanlb Thanks for creating this RFC, looking forward to seeing how this resolves.

@reta
Copy link
Collaborator

reta commented Jul 10, 2024

With #9065 (currently in progress), the OpenSearch core would provide request / response streaming out of the box (it is already available as experimental feature). Having said that, it is totally feasible now to have a plugin (deployed index node) that would stream the documents to the object store (or basically anywhere).

@linuxpi
Copy link
Collaborator

linuxpi commented Jul 25, 2024

[Storage Triage - attendees 1 2 3 4 5 6 7 8]

@bryanlb Thanks for creating this issue. Please feel free to add more details and reachout to folks to collaborate and see how this unfolds

@andrross andrross added the Roadmap:Modular Architecture Project-wide roadmap label label Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Other RFC Issues requesting major changes Roadmap:Modular Architecture Project-wide roadmap label Storage Issues and PRs relating to data and metadata storage
Projects
Status: New
Status: 🆕 New
Development

No branches or pull requests

8 participants