Maintain availability metadata for RC-coded symbols #96

btmorr · 2020-07-11T03:21:31Z

For (k,m) RS-coded data, k+m=N where N is the size of the cluster at the time of write. The RS code algorithm guarantees that it will be possible to recover data as long as it is possible to read at least k unique fragments.

This does not compromise a fault tolerance of F nodes as long as k+F<=N. In Raft, a cluster of N nodes can tolerate F failures as long as F is less than a majority of nodes, thus F increases by 1 for every 2 nodes added to the cluster.

At the time a piece of data is initially encoded and symbols are written to member nodes, k will be chosen based on N at the time of write. For a set of symbols S = {s1, s2, ..., sn} and corresponding replication factors R = {r1, r2, ..., rn}, as long as max(R)-min(R)<=1, it will still be possible to read k unique symbols with F nodes unavailable, even as N and F grow as described above.

In order to guarantee that max(R)-min(R)<=1, when a node is added to the cluster, whichever symbol is least available should be duplicated onto the new node, with any symbol chosen in case of a tie. In order to guarantee that the leader will know which symbol must be written, R must be written as metadata along with each RS coded fragment. This metadata should also make it possible to know which symbols are written to which nodes, so that a leader can guarantee that the above restrictions are maintained during a change in configuration that includes both removing and adding nodes.

Proposed data structure:

Given a configuration list with an agreed-upon order C = [c1, c2, ..., cn], RS-code parameters k and m, and a list of integers R the same length as C, where each integer in R is an integer in the range of 0 until k+m indicating which symbol is stored on the corresponding index of C. As an example, let's say we write a piece of data to a node with 3 members at the time of write. 3 unique symbols will be created from the data (lets say k is chosen as 2--for more info on choosing k, see #93), and one will be written to each server. At this time, the metadata for this entry will be:

C = ["localhost:16990", "localhost:16991", "localhost:16992"]
k = 2
m = 1
R = [0, 1, 2]

Then, later, 2 nodes are added to the cluster. The leader would choose any 2 different symbols (since they're all tied) to write to the 2 nodes. After the write is completed, the metadata might look like this:

C = ["localhost:16990", "localhost:16991", "localhost:16992", "localhost:16993", "localhost:16994"]
k = 2
m = 1
R = [0, 1, 2, 2, 1]

At this point N is 5, therefore F is 2. No matter which 2 nodes are unavailable at a given time, at least 2 unique symbols will still be available, so it will be possible to reconstruct the original message.

Later again, if 2 more nodes were added, this time the leader would be required to pick symbol 0 (since it has the least replication), and then could choose any symbol (including 0) for the other write. After this, the metadata could look like:

C = ["localhost:16990", "localhost:16991", "localhost:16992", "localhost:16993", "localhost:16994", "localhost:16995", "localhost:16996"]
k = 2
m = 1
R = [0, 1, 2, 2, 1, 0, 0]

Now, N is 7 and thus F is 3. If any 3 nodes have failed, there will still be a minimum of 2 unique symbols available and the original data will be able to be reconstructed.

Additional notes:

The C record above is not duplicated for each log--this is part of the configuration message described in Section 6 of the extended Raft paper (see Add ability to dynamically change cluster membership #81 and corresponding implementation). Only k, m, and R are added to each log.
To clarify: logs should never be mutated, so when updated metadata is written this should be a new log, as if the data was overwritten with the identical value. Details of how this is carried out is out of scope for this ticket, but might involve reading and re-writing symbols, or sending a message that instructs each existing node to write a log with a copy of the earlier record and updated metadata (transmitting only the metadata and the index and term of the record to be updated). In either case, this data structure is unaffected.
Availability metadata only applies to voting members--if a node is a permanently non-voting member (read-only replica,per Add support for non-voting members #82) then it does not matter which fragment is replicated to it, so it's not required to maintain metadata for these nodes. If metadata is kept for these node (for instance, to make it easier to convert non-voting members to voting members, or to make distribution of symbols even though not required, it should be kept separate from log entries so that logs don't have to be updated via a vote to add a non-voting member.
It is necessary to know k and m, so that RS-coded data can be correctly reconstructed from symbols on read

The text was updated successfully, but these errors were encountered:

btmorr added the enhancement New feature or request label Jul 11, 2020

btmorr added this to the Coded Raft (dynamic membership) milestone Jul 11, 2020

btmorr mentioned this issue Jul 12, 2020

Send least-replicated symbol to new nodes in log append #106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintain availability metadata for RC-coded symbols #96

Maintain availability metadata for RC-coded symbols #96

btmorr commented Jul 11, 2020 •

edited

Loading

Maintain availability metadata for RC-coded symbols #96

Maintain availability metadata for RC-coded symbols #96

Comments

btmorr commented Jul 11, 2020 • edited Loading

btmorr commented Jul 11, 2020 •

edited

Loading