-
Notifications
You must be signed in to change notification settings - Fork 0
Pravega Multi Region Support
The goal is to have the data in streams created in one region to be read (typically via batch client) in another region.
A region can be viewed as a logical representation of a geographic locality with stable high bandwidth low latency connectivity within a region. A region should not be confused with a datacenter. We will define a region in Pravega as specific logical grouping such that we can have both possible scenarios:
- Multiple regions can be created within the same datacenter.
- One region can span multiple datacenters.
Following are the properties that uniquely describe a logical Pravega region:
- Each region should be identified by a unique region identifier.
- Each segment store cluster will belong to exactly one region.
- A region may contain one or more segment store clusters.
- Each region may have exactly one RW segment store cluster.
- Each region may host multiple RO segment store clusters.
It is out of scope for this discussion how the stream data in tier-2 is made available across different regions. For the scope of this discussion we will assume that the data persisted in tier-2 in one region is available across different regions. There could be multiple ways in which it could have been achieved – for example, shared tier-2, and tier-2 native geo-replication etc. to name a few.
In this document we will focus on Pravega components and how they could be set up to take advantage of tier-2 data availability across multiple regions.
Another constraint under which we will discuss proposals is with respect to stable (including low bandwidth, high latency) network connectivity between two regions. It will be assumed that a break in connectivity shall impact availability of Pravega services for cross region access to streams.
- Enable access to stream data in different regions such that clients access stream's data via a Pravega cluster in their proximity.
- Provide applications seamless ability to work with data from multiple regions including writing to streams within the local region.
- Writing data into different region is not in scope for this design. Applications may choose to perform remote writes but it is not our objective to facilitate it.
To achieve these we will describe how we can set up a combination of Read-Write and Read-Only Pravega clusters in different datacenters where tier-2 data is available such that readers can optimally access stream data for both batch and stream processing and the result of processing can be seamlessly ingested into Pravega.
Before we proceed with the stated objectives, we shall first look at different types of Pravega clusters (Read Write and Read Only).
There are two types of Pravega deployments that could enable data access – Read Write (RW) Pravega clusters and Read Only (RO) Pravega clusters.
Each Pravega deployment requires a mandatory segment store cluster and a mandatory controller cluster. Segment Store cluster is used to manage data in tier-2. Controller cluster is used to manage metadata in zookeeper.
As the name suggests, read-write clusters allow applications to create streams and write data into them. Whereas read-only clusters strictly allow only reads of metadata in zookeeper and data in tier-2 via Pravega data access semantics. Note: The read only Pravega clusters can be used to access the stream data for both batch clients and streaming clients. From Pravega's perspective there should be no distinction in its ability to serve data for either batch or stream clients.
We assume that the reader is familiar with regular Pravega clusters, which will be referred to as RW Pravega clusters as they allow both read and write access to metadata and data. We will present a quick recap of Pravega Services that comprise a Pravega cluster and illustrate differences between RO and RW clusters:
- Segment store cluster : comprises of segment store nodes that have ownership of segment containers distributed amongst themselves. Segment store is configured to access a tier-2 where all data is stored.
- Controller cluster : comprises of stateless controller nodes and a zookeeper where all state is stored.
A segment store cluster can be read only or read write cluster:
- The RW SSS cluster has access to tier-1 and is allowed to write data to tier-2.
- The RO SSS cluster do not require tier-1 and should be configured to only read data from tier-2.
A controller cluster can be read only or read write cluster:
- A RW controller cluster has complete ownership of stream metadata stored in zookeeper and can perform metadata updates.
- A RO controller cluster is only allowed to read metadata from zookeeper.
As the stated objective, the scenario we want to support is that our reader clients (batch and stream) should be able to access the stream data from a local cluster if the data is available locally. To achieve this, we can deploy RO Pravega clusters in regions where the data in tier-2 is available (either via shared tier-2 or replicated) and redirect clients to the cluster closest to them.
The mechanism, as it exists presently, is that clients rely on Controller service to query about stream segments and get redirected to the correct segment store that can access the underlying data in tier-2.
Segment store essentially performs these broad tasks for applications:
- Manage lifecycle of segments
- Segments are mapped to segment containers and each segment can be accessed by only one segment container per segment store cluster.
- Segment containers manage lifecycle of segments and allow creation or deletion of segments.
- Manage data per segment
- At a logical level all data is persisted in tier-2 and can be read back from tier-2 (writing to tier-1 and reading from in-memory cache is engineering optimization).
The difference between a RW segment store and a RO segment store is their ability to manage lifecycle and data within segments. These modifications to segments are persisted in tier-2. So logically RW segment store has the ability to modify content in tier-2 whereas RO segment store only has ability to access what is present in tier-2. Note: There could be multiple RO segment store clusters pointing to same data within tier-2 and each can independently read the data with impacting others. However, there should only be one RW segment store cluster for any data in tier-2, otherwise we can end up with inconsistent data.
So if tier-2 data is available across different regions, it is fairly straightforward to set up two different segment store clusters in these regions and configure them to use the same data (or replicated copy of the same data) in their respective tier-2.
The only constraint we should impose is to restrict exactly one RW segment store cluster for each independent set of streams that we want to manage in our Pravega deployment.
This will allow applications to write data into a stream by only using the RW segment store cluster. However, for reading data within segments, we would allow use of both RW and RO segment store clusters and applications could be redirected to the cluster in their proximity to provide data locality for improved performance.
There are two fundamental functions performed by the Controller Service that are necessary for the client's functionality. The Controller Service:
- manages information about stream segments
- Controller manages all stream specific metadata via Stream-Metadata-Store abstraction
- Zookeeper is used as the store.
- Clients query controller to get
Segments
for the stream.
- Controller manages all stream specific metadata via Stream-Metadata-Store abstraction
- manages information about segment containers ownership
- Controller manages all segment container ownership via HostStore abstraction.
- Controller uses zookeeper to manage segment store cluster and listens for cluster changes and redistributes segment container ownership accordingly.
- Clients request controller to get
Segment URI
for any segment and are redirected to appropriate segment store node for accessing the said segment.
- Controller manages all segment container ownership via HostStore abstraction.
So far, the only difference between an RO controller and an RW controller cluster is that one is allowed to update the metadata while other is not. We can bring up two controller clusters, each sharing the same root node in zookeeper for Stream-Metadata-Store and different roots for Host-Store. This will allow them to share the metadata while managing independent segment store nodes for allowing access to data.
From here on, extending this set up to multiple regions is straightforward. The stream metadata store (zookeeper) is the source of truth about streams and there should be exactly one true copy of this metadata with only the RW cluster being allowed to modify it. This metadata should be accessible for RO purposes from different regions. For our discussion, we will assume single true copy of this metadata owned by RW Pravega cluster and accessed in different regions over a stable network link. Even with low bandwidth, high latency, we can have observer zookeeper nodes that cache the zookeeper state locally for quick RO access.
The other part of controller's responsibility, abstracted out of stream metadata and presented as HostStore abstraction, is to redirect client to appropriate segment store node. The host-store performs the cluster management using zookeeper. This is used to manage segment store nodes which register themselves with the said zookeeper and controller registers for cluster change notifications for segment store nodes. As nodes are added or removed from the cluster, controller updates the segment containers across available nodes. Clients are redirected to the correct segment store node for a given segment based on the owner segment container for the requested segment.
At a broad level we have two choices to make while designing this. Do we deal with a single Pravega deployment that is spread across multiple regions or do we deal with multiple independent Pravega deployments spread across regions with RO Pravega clusters set up for local access within regions.
In first approach, we will have to redefine what constitutes a Pravega deployment. It will essentially mean single controller cluster shared between multiple segment store clusters across different regions.
In second, we will have independent self-contained Pravega deployments per region each with their own set of streams and segment store and controller clusters. This will require minimal changes to our existing controller and segment store implementations. We will need to maintain an external index to allow applications access to streams from multiple such clusters by doing index lookup to get directed to appropriate controller cluster.
Let us take an example of two regions DC1 in region1 in and DC2 in region2.
Administrator has set up his deployment such that tier-2 in DC1 is replicated/available in DC2.
- Application creates Stream S11, S12 in DC1 and S2 in DC2.
- Stream of data is ingested into S11 in DC1.
- Stream of data is ingested into S12 in DC1.
- Application reads data from S11 in DC2 for batch processing and writes its result into S2 in DC2.
- Application has a reader-group for real time processing of data from stream S12 and S2 in DC2.
This translates into following requirements:
- Ability to support multiple RW segment store clusters in different regions.
- Ability to support RO segment store cluster in remote regions.
- Ability for applications to write back result of processing into stream.
- We want to support reader groups with streams created in different regions.
Single Pravega deployment spread across multiple regions.
In this approach, we shall have a single controller service that is shared by all segment store clusters deployed across multiple regions. A common controller cluster will be shared between and will manage all these segment store clusters. Note: controller instances are stateless and the state is stored in zookeeper. So when we are speaking of a shared controller cluster, we are essentially referring to shared zookeeper.
A Pravega deployment may span across multiple such regions with each region containing one or more segment store clusters while sharing a single common controller + zookeeper deployment.
The client and controller will need to include "region-id" as first class primitive in their interactions henceforth.
Controller metadata changes:
- All segment store clusters will be registered in zookeeper under the region-id of the region where they are hosted.
- Each registered segment store cluster will also have a type (RO or RW).
- Each RO segment store cluster's registration will also have a reference to its corresponding RW cluster (potentially in a different region).
- RW segment store clusters: /host-region/RW/
- RO segment store clusters: /host-region/RO/reference-region/
- Each stream will have a "region" associated with it. This region identifies region-id of the region that hosts RW segment store cluster where the stream is created.
- Stream's fully qualified identifier will be "region/scope/stream" triple.
Controller API changes:
- Client requests to controller will require additional information like region (application's region) and purpose (read or write).
- For example
controller.getSegments(application-region, "region/scope/stream", read)
.
- For example
Controller will look up the stream metadata information using Stream-Metadata-Store and then based on application's region and intent, redirect client to either RW segment store for writes and RO segment store cluster in the application's region for reads.
It is important to note that in this approach, all streams created across all data-centers will have a single store for truth. This means applications just need to know about a single controller service which will allow them ability to read from or write to any stream in the multi-region Pravega deployment.
Let us revisit the example and see how it will work out in this approach.
Administrator has set up his deployment such that tier-2 in DC1 is replicated/available in DC2.
- Application creates Stream S11, S12 in DC1 and S2 in DC2.
- Administrator sets up RO segment store cluster in DC2 that points to data from DC1.
- Controller metadata store:
- Stream id: DC1/S11
- Stream id: DC1/S12
- Stream id: DC2/S2
- Controller host store:
- DC1/RW/segment-stores
- DC2/RW/segment-stores
- DC2/RO/DC1/segment-stores
- Application reads data from S11 in DC2 for batch processing and writes its result into S2 in DC2.
- Application -> controller.getSegments(DC2, DC1/S11, read)
- Returns URI under znode DC2/RO/DC1/segment-store // choose cluster in application's region
- Application -> controller.getSegments(DC2, DC2/S2, write)
- Returns URI under znode DC2/RW/segment-store // ignore application's region.Choose cluster with RW for region
- Application has a reader-group for real time processing of data from stream S12 and S2 in DC2.
- Readergroup stream id: DC2/_rg
- Readergroup -> controller.getSegments(DC2, DC1/S12, read)
- Returns URI under znode DC2/RO/DC1/segment-store
- Readergroup -> controller.getSegments(DC2, DC2/S2, read)
- Returns URI under znode DC2/RW/segment-store
Advantages:
- Single controller URI that clients can continue to use for seamless access to all stream across regions.
- Simple topology to deploy.
- RO segment store simply registers against same controller cluster.
- Minor changes in controller to update HostStore abstraction to support multiple regions.
Disadvantages:
- This approach works well only for high bandwidth and low latency stable network connectivity between multiple regions.
- Segment store cluster lifecycle management for clusters from remote regions.
- Remote writes to zookeeper metadata will be costly.
- Significant changes to controller service:
- Metadata schema and APIs need to be updated to include region information.
- Client needs to be aware of its local region.
- Controller client to query controller using region specific APIs.
- Backward incompatible changes
- Imposes artificial restrictions like at-most
one
RW segment store cluster per region (we can improve on this by adding slight complexity to our metadata schema).
Dedicated Controller (+ Zookeeper service) per region
In this approach each Pravega deployment will be a full-fledged deployment of Pravega services (controller cluster and segment store cluster) with a dedicated zookeeper. This effectively makes each cluster completely independent of other clusters in same or different regions with only implicit relationship of RO Pravega clusters on their corresponding RW cluster.
So in each region, there can be multiple RW Pravega clusters and RO Pravega clusters referencing data from different regions.
Each RW Pravega cluster will comprise of:
- a RW segment store cluster
- configured with a tier-1 and tier-2
- configured to use the cluster local zookeeper deployment
- a RW controller cluster is configured with
- Stream-Metadata-Store configured to use cluster local zookeeper deployment
- HostStore configured to use cluster local zookeeper deployment
Each RO Pravega cluster will comprise of:
- a RO segment store cluster configured to read from tier-2
- a RO controller cluster is configured with
- Stream-Metadata-Store configured with observer Zookeeper node (part of zookeeper cluster in corresponding RW cluster in different region)
- HostStore configured with a local zookeeper cluster
So an application can access all streams managed by each independent cluster as long as it knows the controller URI. Depending on the cluster it is accessing, it will either have RO or RW access to streams in the cluster.
However, an application needs an ability to know which controller cluster to access based on stream identifier. For this we may create an index which stores a mapping of region-id to controller URI. We will have a separate index per region.
Index mapping for region to controller-URI will be a key value pair: <region-id, controller-URI>
The index has to be set up statically by the administrator. Each time a new Pravega cluster is created within a region, the administrator should configure the cluster correctly and update the index appropriately.
Index can be a separate logical service endpoint which can use zookeeper as its key value store. Index Endpoint can be hosted as a new service endpoint within controller service from RW Pravega cluster in the region.
Salient features of such a deployment:
- Applications can write into Pravega using region local RW Pravega cluster.
- Applications can read from Pravega using region local RO Pravega cluster for reading data in streams from remote regions.
So following will be typical flow of events:
- Application will extract region from stream-id and perform index look up to get controller URI for the region-id.
- If region-id is same as local region, then access will be to RW Pravega cluster.
- If region-id is for a remote region, then access will be to RO Pravega cluster.
- If the application wishes to access a remote region for write access OR access or remote region for read for which RO cluster is not set up in its local region, then it is out of scope for Pravega to supply remote controller URIs.
- Access controller to get stream metadata and segment store URI.
- Controller will always redirect clients to segment store nodes from its local cluster.
Let us revisit the example and see how it will work out in this approach.
Administrator has set up his deployment such that tier-2 in DC1 is replicated/available in DC2.
- Administrator sets up RO segment store cluster in DC2 that points to data from DC1.
- Controller hosted at RO-controller-URI
- Administrator adds a new index in DC1
- <DC1, RW-controller-URI>
- Administrator adds a new index in DC2
- <DC2, RW-controller-URI>
- <DC1, RO-controller-URI>
- Application creates Stream S11, S12 in DC1 and S2 in DC2.
- Application reads data from S11 in DC2 for batch processing and writes its result into S2 in DC2.
- Application -> getSegments(DC1/S11)
- Lookup index (DC1).getSegements(DC1/S11)
- Application -> controller.getSegments(DC2/S2)
- Lookup index (DC2).getSegements(DC2/S2)
- Application has a reader-group for real time processing of data from stream S12 and S2 in DC2.
- Readergroup stream id: DC2/_rg
- Readergroup -> lookup index(DC1).getSegments(DC1/S12)
- Readergroup -> lookup index(DC2).getSegments(DC2/S2)
Advantages:
- Cluster lifecycle management is done using local zookeeper.
- Zookeeper reads and writes are typically local.
- Each Pravega cluster is independent. Scoped Stream names can be duplicated.
- Minor changes in controller to support RO segment store and separate ZK URI configurations for streamStore and hostStore.
- We can use controller and segment store and client with minimal changes to API or flows.
Disadvantages:
- Introduction of a new concept of index.
- Additional steps involved in setting up RO Pravega clusters against remote regions:
- Each time a new RO Pravega cluster is set up, it will need to include one or more zk observer nodes pointing to zk cluster in original region.
- An entry needs to be added to the index for this region.
- Client code changes:
- needs to support multiple controller URIs
Pravega - Streaming as a new software defined storage primitive
- Contributing
- Guidelines for committers
- Testing
-
Pravega Design Documents (PDPs)
- PDP-19: Retention
- PDP-20: Txn Timeouts
- PDP-21: Protocol Revisioning
- PDP-22: Bookkeeper Based Tier-2
- PDP-23: Pravega Security
- PDP-24: Rolling Transactions
- PDP-25: Read-Only Segment Store
- PDP-26: Ingestion Watermarks
- PDP-27: Admin Tools
- PDP-28: Cross Routing Key Ordering
- PDP-29: Tables
- PDP-30: Byte Stream API
- PDP-31: End-to-End Request Tags
- PDP-32: Controller Metadata Scalability
- PDP-33: Watermarking
- PDP-34: Simplified-Tier-2
- PDP-35: Move Controller Metadata to KVS
- PDP-36: Connection Pooling
- PDP-37: Server-Side Compression
- PDP-38: Schema Registry
- PDP-39: Key-Value Tables Beta 1
- PDP-40: Consistent Order Guarantees for Storage Flushes
- PDP-41: Enabling Transport Layer Security (TLS) for External Clients
- PDP-42: New Resource String Format for Authorization
- PDP-43: Large Events
- PDP-44: Lightweight Transactions
- PDP-45: Health Check
- PDP-46: Read Only Permissions For Reading Data
- PDP-47: Pravega Message Queues
- PDP-48: Key-Value Tables Beta 2
- PDP-49: Segment Store Admin Gateway
- PDP-50: Stream Tags
- PDP-51: Segment Container Event Processor
- PDP-53: Robust Garbage Collection for SLTS