-
Notifications
You must be signed in to change notification settings - Fork 0
PDP 43 (Large Events)
Status: POSTPONED.
See PDP-43A for an alternative.
Discussion and comments in Issue 5056.
Table of Contents:
- Motivation
- Requirements
- Other Considered Approaches
- Design
- Client
- Segment Store Changes
- Client and Wire Protocol Changes
The current maximum Stream Event size is 8MB. This is adequate for the vast majority of Streaming use cases that Pravega can encounter, however there are a few that require more. Video streaming devices, for example, may require more than 8MB per atomic write (the reasons are many: an uncompressed 4K frame has just under 8.3M pixels, each of which taking more than 4 bytes each; a video compressor may end up writing multiple frames at once, with the total write size easily exceeding 100MB). There are other use cases out there, but this one is most relatable to the world of Event Streaming.
After PDP 43 is implemented, Pravega should:
- Increase the maximum Event size from 8MB to 1024MB.
- Future increases may be allowed in subsequent releases, but 1024MB is a good starting point.
- It is OK to add new APIs for such Events, as long as backwards compatibility is preserved for existing code (and Events less than 8MB).
- Large Events should match the ingress/egress performance of writing/reading the same amount of data using the classic (8MB max) API.
- Example. Appending a 1GB Event should have comparable throughput to appending 128 8MB Events in short sequence from the same Client Application/Writer. Similarly on the read side.
- There should not be any added memory pressure on the client. The Pravega Client should not buffer the whole event in the host application memory nor force the calling code to do so.
- This is currently not possible with the current Client API so it is acceptable to provide a slightly modified API to support large events. Current API must be backwards compatible with no breaking contract changes.
Out of scope:
- Large appends will not be supported cross-scaling boundaries. If an Append is initiated on Segment A, and Segment A is then sealed (as a result of a split/merge), then the Append should be aborted and retried by the calling code (the user).
- Cross-scaling support may be added in a future Pravega release.
- Large appends will not be supported on Transactions. This may be done in a future release.
This approach will take the entire Large Event from the user code and send it (as one Append) to the Server. Seemingly simple, this approach is actually infeasible due to a number of factors:
- We cannot buffer the whole Large Append on the Client Side (breaks Requirement 3)
- While we may be able to buffer one such Large Append on the Server Side, this approach will not scale - a few concurrent Large Appends will quickly overrun the memory capacity of the Segment Store.
- The Segment Store buffers an Append until all of its components are received (the Client may choose to split that append into smaller segments)
- Transmission problems. The maximum frame size we can send over the Wire Protocol is 16MB. We are therefore forced to split the append into smaller segments and send individually. Any disconnect will force us to restart the whole process.
This approach would work in the following way:
- When a Large Event is about to be appended, a Transaction is opened (using the Controller).
- We split the Event in smaller chunks (8MB max each) and append them to the Transaction using the same Routing Key as the Large Append.
- When we finished, we commit the Transactions.
Pros:
- Uses existing infrastructure.
- Low cost of development.
- Is not affected by scaling events.
- Transactions are atomic and their contents (per Routing Key) are added as a contiguous block to the target Segment (so the Large Event could be reconstructed).
Cons:
- Each individual append is an Event in itself, with Envelope Header included. Readers will need special handling to differentiate this from regular Events and compose the Large Event.
- The application code itself would be responsible with breaking up the Large Event into smaller chunks and then recomposing it.
- Stream Transactions incur noticeable costs in terms of resources and latency (due to their distributed nature). They are not appropriate for the high frequency required for Large Events and would significantly drag down system performance if used.
We propose using a new concept called Micro Transactions.
This idea is inspired by the regular Stream Transaction workflow, however we can make certain improvements to eliminate the major pain points. We build upon the fact that a Large Append will go to a single Segment (and we know that Segment since we know the Routing Key). As such, we do not need to create an entire Stream Transaction using the Controller - the transaction can be localized at the Stream Segment level. In addition, by making certain tweaks to the internals of the Client, we can ensure that, even if we split the Large Event into smaller chunks, those chunks will not appear as individual Events within the Stream.
This approach would work as follows.
- These are per-segment transactions. The Segment Store creates and manages them at the request of the Client. Each Micro Transaction is implemented by a Transient Segment.
- Workflow
- Create Transient Segment for Segment S (request received via Wire Protocol).
- Segment Store generates name TS as a function of S (i.e., by appending some UUID to the name). It registers TS in the in-memory metadata (not in the Container Segment Metadata).
- In-memory is sufficient (since this implies a record of it is in Tier 1). Since we do not expect this segment to be around for long, there is no need to store it in the Container Segment Metadata (avoid the overhead).
- TS behaves like any other Segment. We can write to it, read from it, have attributes, and be merged into others.
- TS will have a maximum of 4 Extended Attributes (down from Unlimited for regular Segments). Since a single Writer Id is expected to write to it, this should be sufficient).
- TS will be tiered down to LTS similarly to any other Segment.
- TS has a short lifespan. Any Transient Segment has an Idle Expiration Time. If no append has been made to a Transient Segment within this system-wide elapsed time, the Transient Segment TS will be automatically deleted by the Segment Store. This can be piggybacked onto the Metadata Cleanup job that runs periodically. The idle time tracking will be reset upon a Segment Container failover - to give writers enough time to reconnect and resume writing.
- When the Writer is done writing data to TS, it can do a (new) Merge Segment with Header (MSH) operation on TS->S.
- A MSH is just like a Merge Segment operation, but it allows the calling code to atomically insert a header immediately before the start of the merged Segment (we'll see how this is used later on).
- Ex. Segment S has contents
01234
, and TS has contents789
. We do a MSH(TS, S,56
) and after it is complete, S will be0123456789
.
- TS can be deleted at the request of the Writer (if the Large Append is aborted) or automatically if it becomes idle.
We need a new API to support this. The current Serializer-based API requires us to have all the Event's data at-hand, which violates Requirement 3. Let's name this as EventStreamWriter.beginAppend(routingKey)
; this will return a SegmentedEventStreamWriter
(SESW) object. The SESW is a ByteStreamWriter
that has explicit abort
and commit
methods. The SESW has the added benefit of not adding Event Envelopes (so whatever the application writes, it writes as-is). This is a behavior inherited from the ByteStreamWriter.
- EventStreamWriter receives a Large Append to process for segment S.
- A Transient Segment TS is created for Segment S and a SESW is returned that writes solely to TS.
- The application code uses the SESW to upload the Large Append to TS. The SESW exposes an
OutputStream
API which should make this a familiar programming interface for anyone used to dealing with writing large amounts of data. - If the application decides to abort the Large Append, it can do one of the following:
SESW.abort()
- Do nothing (Transient Segment will be auto-deleted once idle).
- When the application is done with the Large Append it can invoke
SESW.commit
(TBD: Can we do this with justflush/close
?). This will cause the following to happen:- The accumulated length of the Large Append is L (the SESW can keep track of that)
- A
MergeSegmentWithHeader
request is made to the Segment Store, instructing to merge the Transient Segment that we have just uploaded to with the Header being the regular Event Header corresponding to L (i.e., add the Event Envelope).
- The Writer will need to handle disconnects/reconnects. It should use offset-conditional appends or attribute-conditional appends on the Transient Segment to keep track of what has been written so far and what not. It should buffer all unacknowledged data for seamless retransmission.
- The
ByteStreamWriter
may already do that, but for clarity, this needs to be done here.
- The
- The Writer will abort the Large Append if the target segment is Sealed (for whatever reason). This situation is out of scope for this proposal.
- TBD. How to handle ordering guarantees for Writers (i.e., we have a number of other regular appends on the same EventStreamWriter while we perform this Large Append - how do we guarantee order?)
We need a modification on the Reader side as well. The EventStreamReader has a method called EventRead<T> readNextEvent(Timeout)
which produces the next Event to be consumed by this reader. We want to keep this consumption pattern, however the EventRead
currently encapsulates the whole Event data (it requires it to be stored in memory and run through the Serializer to create a Java object for the application). This is not ideal as it would break Requirement 3.
We suggest the following reading workflow (modification are highlighted):
-
EventStreamReader.readNextEvent()
works as before: it returns anEventRead
object. - While fetching the Event, the Reader will look at the Event header and inspect the
Length
of the Event. If theLength
is less than or equal to 8MB, this is a regular Event and nothing will work differently. Otherwise this is a Large Event and we'll need to go through a modified workflow. - If a Small Event, then
EventRead
will work exactly as today. No changes. - If a Large Event:
- The Reader may pre-buffer the first 8MB (since it may have already read them as part of reading the header).
-
EventRead.getEvent
will throw anEventTooLargeException
when invoked. This exception will instruct to use the API below. -
EventRead.beginGetEvent
will return aByteStreamReader
object that is targeting the Event's Segment and is bounded to the length of the Event (it may not read beyond the Event's boundary).- This call may also be used to read regular events.
- Care must be taken such that
getEvent()
may prefetch some Event butbeginGetEvent
must return that whole data again. Also, once an Event has been fully served (either viagetEvent
orbeginGetEvent
), that Event should no longer be available via either API. -
getEvent
should not allow advancing the reader whilebeginGetEvent
has an openByteStreamReader
(i.e., no concurrent reads from the same reader).
Using the approach above will deviate from our pattern of accepting a Java Object and using a Serializer to convert that to a ByteBuffer (on the writer side) and then using the same Serializer to convert it back into a Java Object on the reader side. Attempting to use the Serializer
in its current form would cause us to violate Requirement 3 (buffering data at Writer/Reader side), which is why we suggested integrating the ByteStreamWriter
/ByteStreamReader
into the EventStreamWriter
and EventStreamReader
.
Given the sheer size of the Events we are dealing with, it is hard to imagine how a plain Java Object could be serialized to such a large value. However, there might be cases where each such Large Event may have some tiny metadata (relatively speaking) that is associated with it. In this case, we may recommend encoding this data along with the rest of the object using the ByteStreamWriter
. Since this essentially extends OutputStream
, the user can easily wrap that in a DataOutputStream
which will allow encoding typed data.
From an API perspective, on the Client we will be creating a separate API for writing large events and reuse existing APIs (with certain changes) for reading. The Client API will be fully backwards compatible, with no changes required to any applications that are currently written against the API.
On the Writer side:
-
EventStreamWriter
. No changes; this will work exactly as it does now. It will continue to require aSerializer
and accept up to 8MB Events. -
LargeEventStreamWriter
. A sibling toEventStreamWriter
, this can be used to write Events of any size without the need for a Serializer.-
EventWriter streamEvent(String routingKey)
- returns anEventWriter
object that can be used to "stream" the contents of an Event in. - Multiple
EventWriter
objects can be opened concurrently.
-
-
class EventWriter extends OutputStream
-
void commit()
- commits Transient Segment and closes the instance. -
void close()
- ifcommit()
has not been invoked, deletes Transient Segment and closes the instance. -
void flush()
- flushes any buffered data onto the wire to the Transient Segment. -
void write(ByteBuffer)
- similar towrite(byte[], int, int)
but accepts aByteBuffer
- Uses
ByteStreamWriter
underneath. We don't want inheriting from this one since it exposes additional methods that we do not want the application to use. - Optimization: Only create the Transient Segment on the first invocation to
flush()
. If the total amount of accumulated data is less than 8MB (buffered on the client), invokingcommit
will convert this to a regular Append on the target segment (and thus preclude the need for a Transient Segment in the first place).
-
On the Reader side
- Define
EventTooLargeException extends RuntimeException
- We will allow the
Serializer.deserialize
to throw this if it has been programmed to accept a maximum event size. We will not throw it directly from the Client code (see below).
- We will allow the
-
EventRead<>
(returned as part ofEventStreamReader.readNextEvent
).-
T getEvent()
will work as before and attempt to fetch the whole Event and deserialize it. Since this does not pose any OOM danger to the Segment Store, we will allow Events of any size to be deserialized with this method. If the said Event is too large to fit in the Client Application's memory, an OOM/HeapOOM will be thrown by the runtime. In addition, theSerializer
can be configured with a maximum deserialization limit, in which case it may choose to throw anEventTooLargeException
if the Event data was indeed buffered in memory but it cannot be deserialized. - It is worth noting that
EventRead.getEventPointer().getLength()
can be used to get the length of an Event prior to invokinggetEvent
. -
EventReader streamEvent()
- will return anEventReader
instance that can be used to read the raw Event data. - Note: some changes may be required in the
EventStreamWriterImpl
in case it pre-fetches the whole event when returning anEventRead
. We may choose to buffer it out of the SegmentStore in 8MB increments.
-
-
EventStreamReader.fetchEvent(EventPointer p)
- This method is not currently supported for large events and we are not planning to add equivalent support for them. See
EventRead.getEvent
above.
- This method is not currently supported for large events and we are not planning to add equivalent support for them. See
-
class EventReader extends InputStream
- Uses
ByteStreamReader
underneath that begins atEventPointer.getSegmentOffset
and is bounded to at mostEventPointer.getLength()
bytes. We don't want inheriting from this one since it exposes additional methods that we do not want the application to use.
- Uses
Support for Transient Segments:
-
CompletableFuture<String> SegmentStore.createTransientSegment(String parentSegmentName)
- Same as CreateSegment but fewer arguments allowed (no initial attributes and Segment-Store assigned LTS rolling policy)
- Returns the name of the newly created Transient Segment.
- This can be used for the following APIs (inherited from regular segments):
append
,read
,mergeSegment
. - The following APIs will be disallowed:
seal
,truncate
,updateAttributes
.
-
CompletableFuture<Void> SegmentStore.mergeSegment(String sourceSegmentName, String parentSegmentName, BufferView header)
- Performs a Merge-With-Header. See above in Design about details.
- Internal Segment Metadata
- Add a boolean field
isTransient
- Keep track of last used time (wall clock). This is reset with every append or when a container failover happens.
- Add a boolean field
- Metadata Cleaner
- Delete Transient Segments that have been idle.
- Segment Store Configuration
- Define idle timeout.
-
MergeWithHeaderOperation
- New operation that allows a Merge Segment with Header
- This will be a hybrid of the
StreamSegmentAppendOperation
andMergeSegmentOperation
but we need a new one to ensure atomicity. - Internally it may be split in 2: a
CachedStreamSegmentAppendOperation
(for the header) and aMergeSegmentOperation
(for the merge) - as long as the two are atomically accepted into the system.
(TBD on final wire command naming)
-
CreateSegmentTransaction
/CreateTransientSegment
- Requests the creation of a Transient Segment (wired up into
SegmentStore.createTransientSegment
)
- Requests the creation of a Transient Segment (wired up into
-
MergeSegmentTransaction
- Requests the merger of a Transient Segment with a specified header (wired up into
SegmentStore.mergeSegment(Source, Target, Header)
).
- Requests the merger of a Transient Segment with a specified header (wired up into
Pravega - Streaming as a new software defined storage primitive
- Contributing
- Guidelines for committers
- Testing
-
Pravega Design Documents (PDPs)
- PDP-19: Retention
- PDP-20: Txn Timeouts
- PDP-21: Protocol Revisioning
- PDP-22: Bookkeeper Based Tier-2
- PDP-23: Pravega Security
- PDP-24: Rolling Transactions
- PDP-25: Read-Only Segment Store
- PDP-26: Ingestion Watermarks
- PDP-27: Admin Tools
- PDP-28: Cross Routing Key Ordering
- PDP-29: Tables
- PDP-30: Byte Stream API
- PDP-31: End-to-End Request Tags
- PDP-32: Controller Metadata Scalability
- PDP-33: Watermarking
- PDP-34: Simplified-Tier-2
- PDP-35: Move Controller Metadata to KVS
- PDP-36: Connection Pooling
- PDP-37: Server-Side Compression
- PDP-38: Schema Registry
- PDP-39: Key-Value Tables Beta 1
- PDP-40: Consistent Order Guarantees for Storage Flushes
- PDP-41: Enabling Transport Layer Security (TLS) for External Clients
- PDP-42: New Resource String Format for Authorization
- PDP-43: Large Events
- PDP-44: Lightweight Transactions
- PDP-45: Health Check
- PDP-46: Read Only Permissions For Reading Data
- PDP-47: Pravega Message Queues
- PDP-48: Key-Value Tables Beta 2
- PDP-49: Segment Store Admin Gateway
- PDP-50: Stream Tags
- PDP-51: Segment Container Event Processor
- PDP-53: Robust Garbage Collection for SLTS