Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make loosen-follower-log-revert a runtime functionality instead of a feature #1251

Closed
schreter opened this issue Oct 8, 2024 · 2 comments · Fixed by #1259
Closed

Make loosen-follower-log-revert a runtime functionality instead of a feature #1251

schreter opened this issue Oct 8, 2024 · 2 comments · Fixed by #1259
Assignees
Labels
A-replication Area: replication protocol

Comments

@schreter
Copy link
Collaborator

schreter commented Oct 8, 2024

Currently, we have a feature called loosen-follower-log-revert to allow follower's log to be cut. This is normally undesirable, since it shows a problem in the follower implementation and should not be used.

However, there are use cases when a node is being recovered in failure case and the log is (partially) reinitialized. This is orchestrated from the outside. In this case, it might be desirable to temporarily allow follower's log to go back.

There are currently a few options how to do this that I can think of:

  • Create a completely new node with a new node ID and add it to the consensus, then remove the old node. This would work, but it is cumbersome and won't allow easily recovering of a node in-place.
  • Allow log revert in general, which is undesirable, since it covers up potential real problems.
  • Add a functionality to explicitly reset the match index of a particular follower to a specific value to tell the leader that the follower was recovered from a backup and the like.

Opinions? Am I missing something?

Copy link

github-actions bot commented Oct 8, 2024

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

@drmingdrmer
Copy link
Member

The third approach seems feasible to me. This is also a scenario I've encountered, unfortunately :(

@drmingdrmer drmingdrmer self-assigned this Oct 14, 2024
@drmingdrmer drmingdrmer added the A-replication Area: replication protocol label Oct 14, 2024
drmingdrmer added a commit to drmingdrmer/openraft that referenced this issue Oct 15, 2024
…ation for next detected follower log revert

This method requests the RaftCore to allow to reset replication for a
specific node when log revert is detected.

- `allow=true`: This method instructs the RaftCore to allow the target
  node's log to revert to a previous state for one time.

- `allow=false`: This method instructs the RaftCore to panic if the
  target node's log revert

### Behavior

- If this node is the Leader, it will attempt to replicate logs to the
  target node from the beginning.
- If this node is not the Leader, the request is ignored.
- If the target node is not found, the request is ignored.

### Automatic Replication Reset

When the `loosen-follower-log-revert` feature flag is enabled, the
Leader automatically reset replication if it detects that the target
node's log has reverted. This feature is primarily useful in testing
environments.

### Production Considerations

In production environments, state reversion is a critical issue that
should not be automatically handled. However, there may be scenarios
where a Follower's data is intentionally removed and needs to rejoin the
cluster(without membership changes). In such cases, the Leader should
reinitialize replication for that node with the following steps:

- Shut down the target node.
- call [`Self::allow_next_revert`] on the Leader.
- Clear the target node's data directory.
- Restart the target node.

- Fix: databendlabs#1251
drmingdrmer added a commit to drmingdrmer/openraft that referenced this issue Oct 15, 2024
…ation for next detected follower log revert

This method requests the RaftCore to allow to reset replication for a
specific node when log revert is detected.

- `allow=true`: This method instructs the RaftCore to allow the target
  node's log to revert to a previous state for one time.

- `allow=false`: This method instructs the RaftCore to panic if the
  target node's log revert

### Behavior

- If this node is the Leader, it will attempt to replicate logs to the
  target node from the beginning.
- If this node is not the Leader, the request is ignored.
- If the target node is not found, the request is ignored.

### Automatic Replication Reset

When the `loosen-follower-log-revert` feature flag is enabled, the
Leader automatically reset replication if it detects that the target
node's log has reverted. This feature is primarily useful in testing
environments.

### Production Considerations

In production environments, state reversion is a critical issue that
should not be automatically handled. However, there may be scenarios
where a Follower's data is intentionally removed and needs to rejoin the
cluster(without membership changes). In such cases, the Leader should
reinitialize replication for that node with the following steps:

- Shut down the target node.
- call [`Self::allow_next_revert`] on the Leader.
- Clear the target node's data directory.
- Restart the target node.

- Fix: databendlabs#1251
drmingdrmer added a commit to drmingdrmer/openraft that referenced this issue Oct 15, 2024
…ation for next detected follower log revert

This method requests the RaftCore to allow to reset replication for a
specific node when log revert is detected.

- `allow=true`: This method instructs the RaftCore to allow the target
  node's log to revert to a previous state for one time.

- `allow=false`: This method instructs the RaftCore to panic if the
  target node's log revert

### Behavior

- If this node is the Leader, it will attempt to replicate logs to the
  target node from the beginning.
- If this node is not the Leader, the request is ignored.
- If the target node is not found, the request is ignored.

### Automatic Replication Reset

When the `loosen-follower-log-revert` feature flag is enabled, the
Leader automatically reset replication if it detects that the target
node's log has reverted. This feature is primarily useful in testing
environments.

### Production Considerations

In production environments, state reversion is a critical issue that
should not be automatically handled. However, there may be scenarios
where a Follower's data is intentionally removed and needs to rejoin the
cluster(without membership changes). In such cases, the Leader should
reinitialize replication for that node with the following steps:

- Shut down the target node.
- call [`Self::allow_next_revert`] on the Leader.
- Clear the target node's data directory.
- Restart the target node.

- Fix: databendlabs#1251
drmingdrmer added a commit to drmingdrmer/openraft that referenced this issue Oct 24, 2024
…ation for next detected follower log revert

This method requests the RaftCore to allow to reset replication for a
specific node when log revert is detected.

- `allow=true`: This method instructs the RaftCore to allow the target
  node's log to revert to a previous state for one time.

- `allow=false`: This method instructs the RaftCore to panic if the
  target node's log revert

This method returns `Fatal` error if failed to send the request to
RaftCore, e.g. when RaftCore is shut down.
Otherwise, it returns a `Ok(Result<_,_>)`, the inner result is:
- `Ok(())` if the request is successfully processed,
- or `Err(AllowNextRevertError)` explaining why the request is rejected.

### Behavior

- If this node is the Leader, it will attempt to replicate logs to the
  target node from the beginning.
- If this node is not the Leader, the request is ignored.
- If the target node is not found, the request is ignored.

### Automatic Replication Reset

When the `loosen-follower-log-revert` feature flag is enabled, the
Leader automatically reset replication if it detects that the target
node's log has reverted. This feature is primarily useful in testing
environments.

### Production Considerations

In production environments, state reversion is a critical issue that
should not be automatically handled. However, there may be scenarios
where a Follower's data is intentionally removed and needs to rejoin the
cluster(without membership changes). In such cases, the Leader should
reinitialize replication for that node with the following steps:

- Shut down the target node.
- call [`Self::allow_next_revert`] on the Leader.
- Clear the target node's data directory.
- Restart the target node.

- Fix: databendlabs#1251
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-replication Area: replication protocol
Projects
None yet
2 participants