Skip to content

PDP 48 (Key Value Tables Beta 2)

co-jo edited this page Jul 9, 2021 · 2 revisions

Status: IMPLEMENTED (Release 0.10.0). Discussion and comments in Issue 5917.

Important: This refers to Beta 2 of Key-Value Tables. This builds upon the work that has already been done for Beta 1 (PDP-39).

Table of Contents:

  1. Motivation
  2. Requirements
  3. Design
  4. Assumptions
  5. Limitations
  6. Server-Side Changes
  7. Client API

Motivation

The work done in Beta 1 has given us the initial version of Key-Value Tables. Although a good addition to the set of primitives that Pravega exposes, it had a few shortcomings:

  • Performance was not up to par in certain cases. This was due to the introduction of "Sorted Table Segments" - a layer on top of existing Hash Table Segments storing the sorted index within those segments themselves; these could easily be thrashed in certain scenarios (where a large number of distinct keys were inserted in arbitrary order).
  • The client-side API was forcing users to use a String Key Family if they wanted to affinitize keys to partitions (and do multi-key atomic updates or iterators). This was overly restrictive.
  • Listing keys/entries within the same Key Family was good, but it was forcing users to put all those keys in the same Key Family (which meant same partition and, implicitly, same segment). This couldn't scale well. A solution for a global iterator was needed.
  • No ability to do range/prefix iterators.

Requirements

  1. Stable performance that does not cause the server-side (or client) to crash or become unresponsive, even under heavy loads (for inserts, updates, removals or retrievals).
  2. Either remove or replace Key Family with another concept that is more flexible.
  3. Key Ordering (i.e., sorted Key-Value Table) that allows us to do range queries (list keys from A to B), or prefix queries (list all keys beginning with prefix P).
  4. Preserve as much of the public Client API as possible from Beta 1 (i.e., if changes are required, every effort should be made to minimize disruptions).

Design

In order to satisfy requirement #2, we are suggesting the following change to the Key-Value Table Entry Structure. We will define an Entry to be an association of a Key to a Value (with an eTag-like Version). The differences from Beta 1 are:

  • A Key no longer has a Key Family.
  • A Key is made up of a Primary Key and a Secondary Key
    • Primary Key is mandatory for all Keys. It is used for hashing the Key to a partition (similarly to how a Key Family used to be). All Keys with the same Primary Key are going to be hashed to the same partition (segment). A Primary Key need not be a String (it can be whatever the user wants it to be).
    • Secondary Key is optional.
    • Multi-key atomic operations can only be made for Keys with the same Primary Key (those keys will have the same Primary Key, but different Secondary Keys). This is because all these keys are hashed to the same Table Segment, so such operations can be guaranteed to be atomic.

To satisfy requirements #1 and #3, we are suggesting the following changes on the server-side:

  • Completely remove the Sorted Table Segment layer sitting on top of select Hash Table Segments. This was introduced with Beta 1 and was only used for KVTs. Since this is the major source of our performance problems, we either have to redesign it or scrap it altogether.
    • Various solutions have been considered for improving it (including some LSM Tree-like structure on top of Segments) but all such solutions would either 1) not solve our performance bottleneck or 2) be prohibitively complex (hard to develop, maintain and debug).
  • Create a new type of Table Segment. Call this a Fixed-Key-Length Table Segment.
    • While similar to Hash Table Segments (existing ones) and providing the same API, Fixed-Key-Length Table Segments have a major difference (which will affect downstream design:
      • All keys must be the same length (declared when the segment is created) and the length should be at most 256 bytes.
    • These segments will store their keys (in their entirety) in the Segment Attribute Index, which is implemented by a sorted data structure (B+Tree index). As such, we will get natural key order for free. The values associated with these Attributes are simply the offsets within the segment where those keys reside.

Fixed-Key-Length Table Segments:

  • All Keys must be the same length (declared at creation time); Key+Value length must be less than 1MB (same as Hash Table Segments)
    • Key Length is stored in a (new) core segment attribute (and is validated with every operation).
  • Every Entry (Key, Value, Version) is stored as follows:
    • Serialized to main segment (Header, Version, Key, Value) - Same as Hash Table Segments (data format is the same).
    • The Key is inserted into the Segment's Attribute as a pair {Key, Offset}, where Offset is the offset within the main segment where the serialization was appended to.
  • Unconditional updates and insertions are executed as:
    • A Composite Append (Append + Attribute Update); the payload of the Append is the serialization and the Attribute Update maps the Key to the append Offset in the Segment's Attributes.
  • Conditional updates and insertions are performed the same as their Unconditional counterparts, except that:
    • The AttributeUpdate is conditioned (i.e., ReplaceIfEquals) that the value of the Attribute associated with the Key matches what the user provided (the Offset is the version).
    • NOTE: Just like for Hash Table Segments, a bit more work will need to be done to account for compaction - that work is the same as for Hash Table Segments so we won't detail it here.
  • Unconditional removals are executed as:
    • An Attribute Update to remove the Key from the Segment's Attributes.
    • NOTE: we need not serialize the removal itself, even though we could. We may choose to do so in case we have to rebuild the index after some problem (TBD).
  • Conditional removals are performed the same as their Unconditional counterparts, except that:
    • The AttributeUpdate is conditioned that the value of the Attribute associated with the Key matches what the user provided.
    • Same notes regarding compaction apply here.
  • Key Retrievals are performed by looking up the Key in the Segment Attributes and then reading the entry that is pointed to by that Attribute. collisions simplifies a lot of the infrastructure required to support these.
  • Key/Entry Iterators are performed by executing an Entry iterator of the Segment Attributes (already built in) and then retrieving the Entries (if required) from the main segment
    • Range Iterators are a simple delegation to the Segment Attribute Iterator (already supports this).
    • Prefix Iterators can be transformed into a Range Iterator and are done similarly to those.

Important things to note about Fixed-Key-Length Table Segments:

  • There are no key collisions! No hashing keys to fewer bytes means there is no chance of two keys ever colliding in the index.
    • This is one of the major advantages of Fixed-Key-Length Table Segments over their Hash counterparts. The lack of collisions, enables us to avoid having a background indexer to resolve them and also avoid having to resolve such collisions during conditional operations or key retrievals.

Assumptions

  • In the new form, there will no longer be a parametrized API for a Key-Value Table. The API will accept ByteBuffers for Keys and Values and return ByteBuffers for the same. As such:
    • All serialization and deserialization of whatever the Keys and Values are is the sole responsibility of the application that uses the API.
    • All Primary Keys must have the same length throughout the KVT, and same with Secondary Keys. The application using the API must provide ByteBuffers for each of these that have matching lengths for what their intent is (mismatches will be rejected). It is the responsibility of the application to perform any padding (if needed) for keys - Pravega does not understand the contents of those Keys so it cannot do padding for the application.
  • The order between keys will be done using a Bitwise Lexicographic Comparator (this is the actual code doing it - please refer to its Javadoc for documentation).

Limitations

  • Length of Keys (Primary and Secondary) must be declared when the Key-Value Table is created and cannot be changed afterwards. All Primary Keys and Secondary Keys used on that Key Value Table must have lengths matching what was declared at creation time.
  • Upgrading from Beta 1 Key-Value Tables will not be possible as Beta 2 replaces the Table Segment format used previously.
  • New functionality (i.e., create Key-Value Tables) should not be used during a cluster upgrade (if this is released in version X, then from version X-1 to X) or before it has been determined that a downgrade/rollback will not be needed. Key-Value Tables may introduce some new operations in the Segment Store's Durable Log that are not supported by previous versions.
  • There is no backward compatibility with Key-Value Tables that may have been created with Beta 1 code. Any Key-Value Tables created prior to this release will have to be deleted (recommended prior to upgrade).

Server-Side Changes

Segment Store

  • Support Segment Attributes with Ids that are not 16 bytes.
    • Currently all attributes are UUIDs (16 bytes).
    • Define AttributeId that mimics UUID (for now). This can be subclassed into AttributeId.UUID (for Stream Segments and Hash Table Segments) and AttributeId.Variable (for Fixed-Key-Length Table Segments). All places within the Segment Store that use UUIDs (for this purpose) need to be updated.
    • Implement an elegant method to serialize them without affecting stream segment ingestion performance (i.e., do not add extra data to the serialization of appends for those segments).
  • Support Updating Attributes ByReference
    • We need to set the value of an Attribute to something other than just a user-provided number. In this (initial) case, we need to set that value to the current Length of the Segment (+/- a delta offset).
  • Remove existing Sorted Table Segments
    • Search and destroy. Nothing complex here. No backward compatibility to worry about as these aren't used anywhere else.
  • Table Segment Layouts
    • We will be supporting two types of Table Segments (now). For better maintenance, we should consider factoring out each format (and their quirks) into their own, separate layouts.
    • Factor out existing Hash Table Segment into its own layout (HashTableLayout)
    • Create Fixed-Key-Length Table Segment Layout and implement it (FixedKeyLengthLayout).
  • Update Wire Commands
    • For Table Segment Creation (needs new, optional field for key length).
    • For Iterator Arguments (prefix, range, etc.)

Controller

  • Add additional fields to the KVT Config
    • PrimaryKeyLength(int, positive)
    • SecondaryKeyLength(int, zero or positive)
  • These fields must be set via the Create KVT API call
  • In Segment Helper, when creating a Segment
    • If this is for a KVT, then pass sum(PrimaryKeyLength, SecondaryKeyLength) as an arg to the wire command to create the table segment.
    • If not for a KVT, pass 0 for that argument.
  • Create a new client-accessible API to retrieve the configuration for a KVT (by scope+name). This should contain, at a minimum, the number of partitions and the newly added PrimaryKeyLength and SecondaryKeyLength.
    • This should be implemented in such a way that, if we add new fields in the future, including them here in this call should be easy to do.

Client API

  • TableSegment/TableSegmentImpl
    • No major changes anticipated. Since the wire format stays the same, so should this class.
    • The only changes anticipated should be in the area of iterators (to support prefix and range iterators).
  • KeyValueTable, TableKey, TableEntry, IteratorState/Args (contracts)
  • KeyValueTableImpl
    • Will have to be rewritten from scratch to implement the new API (without Key Families and without parametrized types)
    • A LOT OF CARE should be taken handling ByteBuffers. They are notoriously difficult to deal with and very easy to mismanage.
Clone this wiki locally