[RFC] Separation of compute & storage #14637

bryanlb · 2024-07-03T16:47:56Z

Introduction

Modern distributed search engines like OpenSearch, ElasticSearch and Apache Solr were not designed from the ground up to truly take advantage of the cloud’s elasticity, neither were they built to leverage building blocks that have become foundational pieces in public cloud provider offerings like object storage.

Our proposal is to modify OpenSearch to adopt a cloud native architecture, separating compute from durability and storage. The durability of unindexed data would be provided by a persistent queue like Apache Kafka, and the storage for indexed data would be provided by object storage like Amazon S3.

This also enables alternate architectures, such as a deployment that does not keep a hot tier of data nodes, but use cold storage that streams results directly from object storage.

Goals

Cluster elasticity - by using object storage and removing the need for replicating data between OpenSearch nodes it provides the ability to scale up and scale down easily.
Selective resource allocation - compute resources can selectively allocated depending on the nature of the workload, enabling additional query throughout, ingest capacity, or both as needed.
Increase scale - enable easily scaling OpenSearch deployments into the 10s or 100s of Petabytes or more.
Increase cluster resilience - nodes can be quickly replaced, without impacting the availability of the cluster.

Non-Goals

Reduced operating cost - while beneficial, this should not be seen as a primary motivation for adopting this architecture.

Proposed architecture

We propose moving from the existing cluster model architecture to a stateless node architecture, using an event stream / write ahead log for unindexed data, and using object storage as the durable storage.

Ingest nodes - accept bulk ingest requests, submit to an event stream
Event stream - Apache Kafka, Apache Pulsar, etc used as a write head log
Indexing nodes - consume from the event stream to create indexes and upload to object storage
Data nodes - fetch indexes from object storage and make available to query. Optionally can stream indexes directly from object storage.
Coordinating nodes - perform scatter / gather of from data nodes
Metadata store - Apache Zookeeper, etc used as a centralized store for node, index discovery
Manager node - manages operation of the cluster

flowchart LR
  ingest(ingest node) -- event stream --> index 
  index(index node) --> objectstore
  objectstore{{object store}} --> datanode

  coord(coordinating node) <--> datanode
  datanode(data node)

Indexers and data nodes all communicate via a cluster manager and do not replicate any data between themselves.

Discussion

Should ingest nodes be queryable? If they are not queryable this may necessitate the introduction of a real-time node that can make results available quicker, potentially also bypassing the event stream.
Should data nodes perform both hot and frozen searches? Should this be responsibility be split into separate node types dedicated to their respective functions?
In this approach some mapping issues may not be discovered at ingest, and only caught during indexing. Would this be a problem for most users, and how should this be handled?

Summary

We believe moving towards a stateless node architecture will enable operators of OpenSearch deployments to more quickly adapt to changing workload requirements, improve cluster resource utilization, and enable scaling to larger deployments.

References

Slack Astra Search Engine - https://slackhq.github.io/astra/architecture.html#system-overview
The Snowflake Elastic Data Warehouse - https://dl.acm.org/doi/10.1145/2882903.2903741

Proposal co-authored by @vthacker and @bryanlb for @slackhq

Pallavi-AWS · 2024-07-03T17:03:14Z

Thanks @bryanlb for creating this RFC on stateless node architecture. We will join hands with the work going on for reader/writer separation under #14596 (cc: @sohami @andrross @mch2 @getsaurabh02 @msfroh)

getsaurabh02 · 2024-07-03T17:32:32Z

Thanks @bryanlb for starting this RFC. Its well-structured proposal, highlighting the significant benefits of separating compute and storage. It aligns with the Reader and Writer Separation RFC, which also advocates for dedicated node roles, moving us in the same direction.

The high-level goals, such as traffic segregation, separation of concerns for resilience, and independent scalability, are substantial. Ability to scale independently adds significant value from an infrastructure perspective, allowing the use of heterogeneous instance types for different node roles. Additionally, this architecture enables us to tackle more complex problems going forward, such as implementing independent sharding schemes for readers and writers based on traffic patterns (or shard heat). Also, performing post-processing tasks like creating rollups or high-level pre-compute caches/indices for improved read performance can be achieved in better isolation.

The use of object storage for indexed data and a persistent queue like Apache Kafka for unindexed data ensures durability and scalability. It also addresses the indexing scale problem in today's world. With Pull based indexing approach, we can dynamically allocate resources based on workload characteristics, which will help handling varying query loads and ingest rates.

Furthermore, revamping the metadata store should be broadly considered in both proposals. It's also an opportunity to segregate the cluster state with more concise and relevant information based on node roles.

One thing to consider is the potential increase in read after write latency, especially when fetching indexes from object storage? It might be worth to think what strategies can we employ to optimize the performance of real-time queries in this new architecture?

kogent · 2024-07-03T18:33:44Z

One thing to consider is the potential increase in read after write latency, especially when fetching indexes from object storage? It might be worth to think what strategies can we employ to optimize the performance of real-time queries in this new architecture?

i think that is called out in the question from the proposal:

Should ingest nodes be queryable? If they are not queryable this may necessitate the introduction of a real-time node that can make results available quicker, potentially also bypassing the event stream.

getsaurabh02 · 2024-07-05T15:32:15Z

Adding @msfroh @andrross @sohami @mch2 for their feedback/comments.

peternied · 2024-07-10T15:17:02Z

[Triage - attendees 1 2 3 4 5]
@bryanlb Thanks for creating this RFC, looking forward to seeing how this resolves.

reta · 2024-07-10T19:51:16Z

With #9065 (currently in progress), the OpenSearch core would provide request / response streaming out of the box (it is already available as experimental feature). Having said that, it is totally feasible now to have a plugin (deployed index node) that would stream the documents to the object store (or basically anywhere).

linuxpi · 2024-07-25T15:24:41Z

[Storage Triage - attendees 1 2 3 4 5 6 7 8]

@bryanlb Thanks for creating this issue. Please feel free to add more details and reachout to folks to collaborate and see how this unfolds

bryanlb added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 3, 2024

github-project-automation bot added this to OpenSearch Roadmap Jul 3, 2024

github-project-automation bot moved this to Issues and PR's in OpenSearch Roadmap Jul 3, 2024

github-actions bot added the Other label Jul 3, 2024

peternied added RFC Issues requesting major changes Storage Issues and PRs relating to data and metadata storage and removed untriaged labels Jul 10, 2024

github-project-automation bot added this to Storage Project Board Jul 10, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Jul 10, 2024

andrross added the Roadmap:Modular Architecture Project-wide roadmap label label Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Separation of compute & storage #14637

[RFC] Separation of compute & storage #14637

bryanlb commented Jul 3, 2024 •

edited

Loading

Pallavi-AWS commented Jul 3, 2024

getsaurabh02 commented Jul 3, 2024 •

edited

Loading

kogent commented Jul 3, 2024

getsaurabh02 commented Jul 5, 2024

peternied commented Jul 10, 2024

reta commented Jul 10, 2024 •

edited

Loading

linuxpi commented Jul 25, 2024

[RFC] Separation of compute & storage #14637

[RFC] Separation of compute & storage #14637

Comments

bryanlb commented Jul 3, 2024 • edited Loading

Introduction

Goals

Non-Goals

Proposed architecture

Discussion

Summary

References

Pallavi-AWS commented Jul 3, 2024

getsaurabh02 commented Jul 3, 2024 • edited Loading

kogent commented Jul 3, 2024

getsaurabh02 commented Jul 5, 2024

peternied commented Jul 10, 2024

reta commented Jul 10, 2024 • edited Loading

linuxpi commented Jul 25, 2024

bryanlb commented Jul 3, 2024 •

edited

Loading

getsaurabh02 commented Jul 3, 2024 •

edited

Loading

reta commented Jul 10, 2024 •

edited

Loading