Make `loosen-follower-log-revert` a runtime functionality instead of a feature #1251

schreter · 2024-10-08T09:45:47Z

Currently, we have a feature called loosen-follower-log-revert to allow follower's log to be cut. This is normally undesirable, since it shows a problem in the follower implementation and should not be used.

However, there are use cases when a node is being recovered in failure case and the log is (partially) reinitialized. This is orchestrated from the outside. In this case, it might be desirable to temporarily allow follower's log to go back.

There are currently a few options how to do this that I can think of:

Create a completely new node with a new node ID and add it to the consensus, then remove the old node. This would work, but it is cumbersome and won't allow easily recovering of a node in-place.
Allow log revert in general, which is undesirable, since it covers up potential real problems.
Add a functionality to explicitly reset the match index of a particular follower to a specific value to tell the leader that the follower was recovered from a backup and the like.

Opinions? Am I missing something?

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-08T09:46:00Z

👋 Thanks for opening this issue!

Get help or engage by:

/help : to print help messages.
/assignme : to assign this issue to you.

drmingdrmer · 2024-10-09T08:16:08Z

The third approach seems feasible to me. This is also a scenario I've encountered, unfortunately :(

…ation for next detected follower log revert This method requests the RaftCore to allow to reset replication for a specific node when log revert is detected. - `allow=true`: This method instructs the RaftCore to allow the target node's log to revert to a previous state for one time. - `allow=false`: This method instructs the RaftCore to panic if the target node's log revert ### Behavior - If this node is the Leader, it will attempt to replicate logs to the target node from the beginning. - If this node is not the Leader, the request is ignored. - If the target node is not found, the request is ignored. ### Automatic Replication Reset When the `loosen-follower-log-revert` feature flag is enabled, the Leader automatically reset replication if it detects that the target node's log has reverted. This feature is primarily useful in testing environments. ### Production Considerations In production environments, state reversion is a critical issue that should not be automatically handled. However, there may be scenarios where a Follower's data is intentionally removed and needs to rejoin the cluster(without membership changes). In such cases, the Leader should reinitialize replication for that node with the following steps: - Shut down the target node. - call [`Self::allow_next_revert`] on the Leader. - Clear the target node's data directory. - Restart the target node. - Fix: databendlabs#1251

…ation for next detected follower log revert This method requests the RaftCore to allow to reset replication for a specific node when log revert is detected. - `allow=true`: This method instructs the RaftCore to allow the target node's log to revert to a previous state for one time. - `allow=false`: This method instructs the RaftCore to panic if the target node's log revert This method returns `Fatal` error if failed to send the request to RaftCore, e.g. when RaftCore is shut down. Otherwise, it returns a `Ok(Result<_,_>)`, the inner result is: - `Ok(())` if the request is successfully processed, - or `Err(AllowNextRevertError)` explaining why the request is rejected. ### Behavior - If this node is the Leader, it will attempt to replicate logs to the target node from the beginning. - If this node is not the Leader, the request is ignored. - If the target node is not found, the request is ignored. ### Automatic Replication Reset When the `loosen-follower-log-revert` feature flag is enabled, the Leader automatically reset replication if it detects that the target node's log has reverted. This feature is primarily useful in testing environments. ### Production Considerations In production environments, state reversion is a critical issue that should not be automatically handled. However, there may be scenarios where a Follower's data is intentionally removed and needs to rejoin the cluster(without membership changes). In such cases, the Leader should reinitialize replication for that node with the following steps: - Shut down the target node. - call [`Self::allow_next_revert`] on the Leader. - Clear the target node's data directory. - Restart the target node. - Fix: databendlabs#1251

drmingdrmer self-assigned this Oct 14, 2024

drmingdrmer added the A-replication Area: replication protocol label Oct 14, 2024

drmingdrmer mentioned this issue Oct 15, 2024

Feature: Raft::trigger()::allow_next_revert() allow to reset replication for next detected follower log revert #1259

Merged

drmingdrmer closed this as completed in #1259 Oct 31, 2024

drmingdrmer closed this as completed in 47e0220 Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `loosen-follower-log-revert` a runtime functionality instead of a feature #1251

Make `loosen-follower-log-revert` a runtime functionality instead of a feature #1251

schreter commented Oct 8, 2024

github-actions bot commented Oct 8, 2024

drmingdrmer commented Oct 9, 2024

Make loosen-follower-log-revert a runtime functionality instead of a feature #1251

Make loosen-follower-log-revert a runtime functionality instead of a feature #1251

Comments

schreter commented Oct 8, 2024

github-actions bot commented Oct 8, 2024

drmingdrmer commented Oct 9, 2024

Make `loosen-follower-log-revert` a runtime functionality instead of a feature #1251

Make `loosen-follower-log-revert` a runtime functionality instead of a feature #1251