Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain availability metadata for RC-coded symbols #96

Open
btmorr opened this issue Jul 11, 2020 · 0 comments
Open

Maintain availability metadata for RC-coded symbols #96

btmorr opened this issue Jul 11, 2020 · 0 comments
Labels
enhancement New feature or request

Comments

@btmorr
Copy link
Owner

btmorr commented Jul 11, 2020

For (k,m) RS-coded data, k+m=N where N is the size of the cluster at the time of write. The RS code algorithm guarantees that it will be possible to recover data as long as it is possible to read at least k unique fragments.

This does not compromise a fault tolerance of F nodes as long as k+F<=N. In Raft, a cluster of N nodes can tolerate F failures as long as F is less than a majority of nodes, thus F increases by 1 for every 2 nodes added to the cluster.

At the time a piece of data is initially encoded and symbols are written to member nodes, k will be chosen based on N at the time of write. For a set of symbols S = {s1, s2, ..., sn} and corresponding replication factors R = {r1, r2, ..., rn}, as long as max(R)-min(R)<=1, it will still be possible to read k unique symbols with F nodes unavailable, even as N and F grow as described above.

In order to guarantee that max(R)-min(R)<=1, when a node is added to the cluster, whichever symbol is least available should be duplicated onto the new node, with any symbol chosen in case of a tie. In order to guarantee that the leader will know which symbol must be written, R must be written as metadata along with each RS coded fragment. This metadata should also make it possible to know which symbols are written to which nodes, so that a leader can guarantee that the above restrictions are maintained during a change in configuration that includes both removing and adding nodes.

Proposed data structure:

Given a configuration list with an agreed-upon order C = [c1, c2, ..., cn], RS-code parameters k and m, and a list of integers R the same length as C, where each integer in R is an integer in the range of 0 until k+m indicating which symbol is stored on the corresponding index of C. As an example, let's say we write a piece of data to a node with 3 members at the time of write. 3 unique symbols will be created from the data (lets say k is chosen as 2--for more info on choosing k, see #93), and one will be written to each server. At this time, the metadata for this entry will be:

C = ["localhost:16990", "localhost:16991", "localhost:16992"]
k = 2
m = 1
R = [0, 1, 2]

Then, later, 2 nodes are added to the cluster. The leader would choose any 2 different symbols (since they're all tied) to write to the 2 nodes. After the write is completed, the metadata might look like this:

C = ["localhost:16990", "localhost:16991", "localhost:16992", "localhost:16993", "localhost:16994"]
k = 2
m = 1
R = [0, 1, 2, 2, 1]

At this point N is 5, therefore F is 2. No matter which 2 nodes are unavailable at a given time, at least 2 unique symbols will still be available, so it will be possible to reconstruct the original message.

Later again, if 2 more nodes were added, this time the leader would be required to pick symbol 0 (since it has the least replication), and then could choose any symbol (including 0) for the other write. After this, the metadata could look like:

C = ["localhost:16990", "localhost:16991", "localhost:16992", "localhost:16993", "localhost:16994", "localhost:16995", "localhost:16996"]
k = 2
m = 1
R = [0, 1, 2, 2, 1, 0, 0]

Now, N is 7 and thus F is 3. If any 3 nodes have failed, there will still be a minimum of 2 unique symbols available and the original data will be able to be reconstructed.

Additional notes:

  • The C record above is not duplicated for each log--this is part of the configuration message described in Section 6 of the extended Raft paper (see Add ability to dynamically change cluster membership #81 and corresponding implementation). Only k, m, and R are added to each log.
  • To clarify: logs should never be mutated, so when updated metadata is written this should be a new log, as if the data was overwritten with the identical value. Details of how this is carried out is out of scope for this ticket, but might involve reading and re-writing symbols, or sending a message that instructs each existing node to write a log with a copy of the earlier record and updated metadata (transmitting only the metadata and the index and term of the record to be updated). In either case, this data structure is unaffected.
  • Availability metadata only applies to voting members--if a node is a permanently non-voting member (read-only replica,per Add support for non-voting members #82) then it does not matter which fragment is replicated to it, so it's not required to maintain metadata for these nodes. If metadata is kept for these node (for instance, to make it easier to convert non-voting members to voting members, or to make distribution of symbols even though not required, it should be kept separate from log entries so that logs don't have to be updated via a vote to add a non-voting member.
  • It is necessary to know k and m, so that RS-coded data can be correctly reconstructed from symbols on read
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant