Skip to content

Recipe: Rollback During Startup

Paul Nowoczynski edited this page Jan 7, 2021 · 2 revisions

Objective

Here we want to show that a former leader will rollback its stale entries in the case where it crashed with uncommitted entries and it had a more advanced log than all of its followers. However, on restart this leader is last to start up.. so it cannot win the election. Once it is able to join, the cluster’s new term value must force the old leader to roll back its uncommitted entries from the (old) term in which it was the leader.

This recipe shares initial steps with Completing an Uncommitted Write Following a Reboot

Recipe Preparation

This recipe will start off by using the preparation and first 3 steps from the recipe Completing an Uncommitted Write Following a Reboot.

Be sure to capture the leader UUID from “Completing an Uncommitted Write Following a Reboot” and the default client request timeout (which should have been set to 1 second).

Steps 1 -> 3: Refer to “Completing an Uncommitted Write Following a Reboot”.

4. Poll for the Client Request to Timeout

      "pumice_db_test_client" : {
                "pmdb-test-apps" : [
                        {
                                "app-user-id" : "0771f672-0748-11eb-a0df-90324b2d1e89:0:0:0:0",
                                "status" : "Connection timed out",
                                "pmdb-seqno" : 0,
                                "pmdb-write-pending" : false,
                                "last-request" : "Tue Oct 06 21:40:36 UTC 2020",
                                "last-request-duration-ms" : 3011,
                                "last-request-tag" : 1539832552,
                                "app-sync" : false,
                                "app-seqno" : 1,
                                "app-value" : 1983817102,
                                "app-validated-seqno" : 0
                        }
                ],
                "pmdb-request-history" : [
                        {
                                "app-user-id" : "0771f672-0748-11eb-a0df-90324b2d1e89:0:0:0:0",
                                "op" : "write",
                                "status" : "Connection timed out",
                                "pmdb-req-seqno" : 0,
                                "pmdb-seqno" : 0,
                                "pmdb-write-pending" : false,
                                "submitted-time" : "Tue Oct 06 21:40:36 UTC 2020",
                                "duration-ms" : 1011,
                                "last-request-tag" : 1539832552,
                                "app-seqno" : 1,
                                "app-value" : 1983817102
                        }
                ],

5. Start every Peer except the leader from Step 1

Poll waiting for the completion of the election.

Note the new leader UUID and Term.

5a. Verifications for the running peers:

Verify that the cluster is a sane state where the running followers and leader agree on these /raft_root_entry/ KVs:

  • "term" :
  • "commit-idx" : 1
  • “last-applied" : 1
  • “last-applied-cumulative-crc" :
  • "newest-entry-idx" :
  • "newest-entry-term" :
  • "newest-entry-data-size" :
  • "newest-entry-crc" :

##6. Start the Last Peer (which was the leader from Step #1) Poll waiting for it to become a follower and for its commit-idx become 1.

   "raft_root_entry" : [
                {
                        "raft-uuid" : "2b310920-081c-11eb-811b-90324b2d1e89",
                        "peer-uuid" : "2b31df8a-081c-11eb-bc78-90324b2d1e89",
                        "voted-for-uuid" : "00000000-0000-0000-0000-000000000000",
                        "leader-uuid" : "2b3271b6-081c-11eb-890a-90324b2d1e89",
                        "state" : "follower",
                        "follower-reason" : "leader-already-present",
                        "client-requests" : "redirect-to-leader",
                        "term" : 18,
                        "commit-idx" : 1,
                        "last-applied" : 1,
                        "last-applied-cumulative-crc" : 2733010441,
                        "newest-entry-idx" : 1,
                        "newest-entry-term" : 18,
                        "newest-entry-data-size" : 0,
                        "newest-entry-crc" : 3109780162,
                        "dev-read-latency-usec" : {},
                        "dev-write-latency-usec" : {
                                "1024" : 1
                        }
                }
        ],

6a. Redo verifications from 5a to include this peer

6b. Ensure that the leader and term from Step #5 have not changed

7. Issue a Read on the RNCUI used in Timed-Out Write Operation

No new object should have been written in this recipe since the request had timed out before the cluster could commit the write. Therefore, issuing a read operation for the object should result in an error of “No such file or directory”.

Execute Step #8 from “Completing an Uncommitted Write Following a Reboot” but use these verifications instead:

7a - Verifications

  • "pmdb-seqno" : 0,
  • "app-user-id" : RNCUI used in write request
  • "op" : "read",
  • "status" : "No such file or directory",
Clone this wiki locally