Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

121 last quorum acked #93

Closed
wants to merge 62 commits into from
Closed

121 last quorum acked #93

wants to merge 62 commits into from

Conversation

drmingdrmer
Copy link
Owner

Checklist

  • Updated guide with pertinent info (may not always apply).
  • Squash down commits to one or two logical commits which clearly describe the work you've done.
  • Unittest is a friend:)

drmingdrmer and others added 30 commits May 1, 2024 10:20
…is received

When a `SnapshotMismatch` is received, the sending end should re-send
all snapshot data from the beginning.
When `Engine::handle_install_full_snapshot()` is called and the provided
snapshot is not up-to-date, the snapshot should not be installed, and
the response should be sent back immediately. Previously, the method
might delay the response unnecessarily, waiting for an installation
process that would not proceed.

This commit adjusts the logic so that if the snapshot is recognized as
outdated, it immediately returns a `None` `Condition`, ensuring the
caller is informed straightaway that no installation will occur.
- Remove `generic-snapshot-data`. The snapshot data trait is now
  consistently bound to `OptionalSend + 'static`.

- Introduction of `RaftNetworkV2::full_snapshot()` for transmitting
  snapshots in one piece.

- Retention of the original chunk-based snapshot transmission API in
  `RaftNetwork::install_snapshot()`.

From this commit onwards:

- To use the one-piece snapshot transmission, implement `RaftNetworkV2`
  when creating a `Raft` instance.

- To continue using the chunk-based transmission, implement
  `RaftNetwork`.

Compatibility:

- `RaftNetworkV2` is automatically implemented for any `RaftNetwork` if
  `RaftTypeConfig::SnapshotData` implements `tokio::io::AsyncRead`,
  `tokio::io::AsyncWrite`, `tokio::io::AsyncSeek`, and `Unpin`.

Upgrade tip:

- Remove the `generic-snapshot-data` feature flag.

- If you prefer using the original chunk-based method, no further action
  is required.

  Otherwise, to adopt the new one-piece snapshot transmission:

  - Replace `RaftNetwork` and `RaftNetworkFactory` with `RaftNetworkV2`
    and `RaftNetworkFactoryV2`.
…kv-memstore` example (databendlabs#1121)

- Avoid race conditions in `raft-kv-memstore` example

- Add Storage tests for `raft-kv-memstore`

- Test: add snapshotting test to Suite
`LogIOId` is utilized to uniquely identify an append-entries IO
globally. It plays a crucial role in the log IO callback, informing
`RaftCore` about the latest log that has been successfully flushed to
disk.

In this commit, `LogIOId` is defined as a tuple comprising `leader_id`,
which submits the log IO, and `last_log_id`, which represents the final
log ID in the append-entries IO operation.

Given that the `leader_id` is monotonically increasing and a leader
consistently submits log entries in a sequential order, this design
ensures that `LogIOId` is also monotonically increasing. This feature
allows for effective tracking of IO progress.
This commit addresses a critical issue where if a new leader does not
flush the blank log to disk upon becoming established and then restarts
immediately, there is a possibility that previously committed data
becomes invisible to readers.

Before the blank log is flushed, the leader (identified by vote `v3`)
assumes it will be flushed and commits this log once (|quorum|-1)
replication responses are received. If the blank log is lost and the
server is restarted, data committed by a new leader (vote `v2`) may
not be visible.

This issue is addressed by utilizing `LeaderHandler::leader_append_entries()`
instead of `ReplicationHandler::append_blank_log()`, where the former
does not wait for the blank log to flush.

Changes:

- When assigning log IDs to log entries, the `Leading.last_log_id`,
  which represents the state of the log proposer (equivalent term in
  Paxos is Proposer), should be used instead of `RaftState.last_log_id`,
  which represents the state of the log receiver (equivalent term in
  Paxos is Acceptor).

- Consequently, the method `assign_log_ids()` has been moved from
  `RaftState` to `Leading`.

- Avoid manual implementation of duplicated logic:

  - During `initialize()`, reuse `FollowingHandler::do_append_entries()`
    to submit the very first log to storage.

  - In `establish_leader()`, reuse
    `LeaderHandler::leader_append_entries()` to submit log to storage
    and remove `ReplicationHandler::append_blank_log()`.

  - Remove `Command::AppendEntry`.
In this commit, `LeaderId` has been replaced with `CommittedLeaderId`
for identifying log IO requests. This change is made because
`CommittedLeaderId` is shorter and sufficiently identifies the necessary
attributes since only an established leader (a committed leader) can
write or replicate logs. Thus, using `CommittedLeaderId` streamlines the
identification process for log IO requests.
This commit adds the `RaftLogReader::limited_get_log_entries()` method,
which enables applications to fetch log entries that are equal to or
smaller than a specified range. This functionality is particularly
useful for customizing the size of AppendEntries requests at the storage
API level.

- Applications can now decide the number of log entries to return based
  on the input range. If the application determines that the requested
  log entries range is too large for a single RPC, it can opt to return
  only the first several requested log entries instead of the full
  range.

- The method provides a default implementation that delegates the
  operation to `RaftLogReader::try_get_log_entries`.

This enhancement allows for more flexible and efficient handling of log
entries, particularly in scenarios where network constraints or
performance considerations require smaller data transfers.
This commit updates `RaftNetworkV2` to treat any remote errors occurring
during RPC, like a `Fatal` error, as an `Unreachable` error. This is due
to Openraft's current inability to distinguish between an unreachable
node and a broken node.

Other changes:

- **Helper trait `DecomposeResult`:** Introduced to simplify handling
  composite errors. It converts a result of the
  form `Result<R, ErrorAOrB>`
  into a nested result `Result<Result<R, ErrorA>, ErrorB>`.

Upgrade tip:

- Implementations of `RaftNetworkV2` or `RaftNetwork` should now convert
  remote errors such as `Raft::Fatal` to `Unreachable` errors to
  maintain consistency in error handling.
- **Election Update Delay:**
  - The `Voting` state is now updated only after `SaveVote` is
    successfully persisted on disk. This anticipates future
    modifications where `SaveVote` will be callback-based, ensuring
    state consistency.

- **Deprecation of `VoteResponse.vote_granted`:**
  - Granting a `RequestVote` is now determined solely by whether
    `VoteResponse.vote` matches the candidate's vote. This simplifies
    the logic and reduces ambiguity in vote handling.

- **Replication Stream Reorganization:**
  - Replication stream handles have been moved from
    `RaftCore::LeaderData` to `RaftCore`. This reorganization prepares
    for supporting `RequestVote` RPCs in the replication mechanism,
    although it currently continues to support only log and snapshot
    replication.

- **Vote Comparison Enhancement:**
  - When filtering out stale messages (e.g., when a new Leader receives
    an `AppendEntries` response intended for a previous Leader), the
    responded vote is now compared with the vote in the `Leading` state.
    Previously, it was compared with `RaftCore.engine.raft_state.vote`,
    which might differ from the `Leading` state's vote in future
    scenarios.

- **Election State Management:**
  - During election, `Engine.candidate` is `Some` and `Engine.leader` is
    `None`. Upon election completion, `Engine.candidate` is cleared, and
    a new `Leader` is assigned to `Engine.leader`.
  - Renamed `Engine.internal_server_state` to `leader`, as it now only
    stores a `Leader` instance. The candidate state is moved to
    `Engine.candidate`.
  - Moved the process of establishing a Leader to `EstablishHandler`.
  - A Leader can now be established in two ways: by election or by
    re-creating the Leader upon startup.
  - Replace `InternalServerState` with `Option<Box<Leader<...>>>`

- **Logging Enhancements:**
  - Added `DisplaySliceExt` to create a displayable instance for slices
    of `impl Display`.
  - Added `DisplayInstant` to display `Instant` in a human-readable
    format, improving logging readability.
…` response

This commit addresses an issue in the vote updating mechanism during the
handling of `RequestVote` responses. Previously, encountering a higher
`vote` in a response incorrectly led to an update of the local
`state.vote`, which could break Raft consistency rules.

**Issue Description:**
- A higher `vote` seen in a `RequestVote` response does not necessarily
  mean that the vote is granted. The local `state.vote` should only be
  updated if the vote is actually granted, i.e., the responding node has
  a higher `vote` and a `last_log_id` that allows it to become a leader.

**Resolution:**
- The local `state.vote` will no longer be updated upon merely seeing a
  higher `vote` in the `RequestVote` response. Instead, this higher vote
  will be recorded in `last_seen_vote` for consideration in the next
  election cycle, without updating the current `state.vote`.

This bug is introduced in `0.10`: 36d2c11
This commit introduces a new trait, `TypeConfigExt`, which extends
`RaftTypeConfig`. The purpose of this trait is to simplify the access to
various functionalities provided by the `RaftTypeConfig` trait,
enhancing code readability and reducing complexity.

**Methods Added to `TypeConfigExt`:**
- `now()`
- `sleep()`
- `sleep_until()`
- `timeout()`
- `timeout_at()`
- `oneshot()`
- `spawn()`

**Usage Improvement:**
- Instead of using the
  `<<C as RaftTypeConfig>::AsyncRuntime as AsyncRuntime>::Instant::now()`,
  you can now simply call `C::now()`.
- `ReplicationSessionId<NID>` to `ReplicationSessionId<C>`
- `LeaderQuorumSet<NID>` to `LeaderQuorumSet<C>`
- `Condition<NID>` to `Condition<C>`
- `ProgressEntry<NID>` to `ProgressEntry<C>`
- `StorageError<NID>` to `StorageError<C>`
- `StorageIOError<NID>` to `StorageIOError<C>`
- `ErrorSubject<NID>` to `ErrorSubject<C>`
- `DefensiveError<NID>` to `DefensiveError<C>`
- `Violation<NID>` to `Violation<C>`
- `SnapshotSignature<NID>` to `SnapshotSignature<C>`
- `ChangeMembers<NID, N>` to `ChangeMembers<C>`
- `LogIdList<NID>` to `LogIdList<C>`

Upgrade tip:

Replace `StorageError<NID>` with `StorageError<C>`, such as:
```diff
-  async fn apply<I>(&mut self, entries: I) -> Result<Vec<Response>, StorageError<NodeId>>
+  async fn apply<I>(&mut self, entries: I) -> Result<Vec<Response>, StorageError<TypeConfig>>
```

Types to replace:

- `StorageError<NID>` to `StorageError<C>`
- `StorageIOError<NID>` to `StorageIOError<C>`
- `ErrorSubject<NID>` to `ErrorSubject<C>`
- `DefensiveError<NID>` to `DefensiveError<C>`
- `Violation<NID>` to `Violation<C>`
- `SnapshotSignature<NID>` to `SnapshotSignature<C>`
- `ChangeMembers<NID, N>` to `ChangeMembers<C>`
- `LogIdList<NID>` to `LogIdList<C>`
This commit introduces a new type, `CommittedVote`, which represents a
committed vote. This enhancement is aimed at ensuring that only
committed votes are considered valid for the leader's status.

By using the `CommittedVote` type, we can eliminate several runtime
assertion checks previously required to verify the commitment status of
votes.
LogIdRange
Inflight
LogStateReader
Accepted
drmingdrmer and others added 27 commits July 8, 2024 14:59
When a replication response is handled, RaftCore should initiate next
replication if there are more logs to to replicate.

Before this commit this is done in response-handler.
In this commit it is moved to `RaftCore::runtime_loop`:
replications are re-initiated when all of the events are processed.

This way the replication response handler is simplified.
This commit enhances the initialization process of a `Leader` instance
by automating the setup of progress information and replication streams.
Previously, these tasks were manually performed by the caller after the
leader's creation, which could lead to inconsistencies or additional
complexity.
* Refactor: Abstract the watch channel into a traits
Since this commit, `RaftCore` returns at once upon submitting
ApendEntries IO request to `RaftLogStorage`, without waiting for the IO
to be flushed to disk. When flushed, the result is responded to
`RaftCore` via a `Notify` channel.

This way `RaftCore` won't be blocked by AppendEntries IO operation:
while entries being flushing to disk, `RaftCore` is still able to deal
with other operations.

Upgrade(non-breaking) tip:

- Deprecated `LogFlushed`, use `IOFlushed` instead.
- Deprecated `LogFlushed::log_io_completed()`, use `IOId::io_completed()` instead.
This commit refines our logging strategy by utilizing the `Display`
trait instead of `Debug` for significant events. This change is aimed at
producing logs that are easier to read and understand.

Three kinds of significant events are logged at DEBUG level:

- `input`: the `RaftMsg`s received by `RaftCore`, such as client-write
  or AppendEntries request from the Leader.

- `cmd`: the `Command` outputted by `Engine` to execute by storage or
  network layer, such as `AppendInputEntries` or `ReplicateCommitted`.

- `notify`: the `Notification`s received by `RaftCore` from storage or
  network.

Example significant event logs:
```
RAFT_event id=0     cmd: Commit: seq: 5, (T1-N0.4, T1-N0.5]
RAFT_event id=1   input: AppendEntries: vote=<T1-N0:Q>, prev_log_id=T1-N0.5, leader_commit=T1-N0.5, entries=[]
RAFT_event id=0  notify: sm::Result(command_seq:5, Ok(ApplyResult([5, 6), last_applied=T1-N0.5, entries=[T1-N0.5])))
RAFT_event id=0   input: ClientWriteRequest
```
This commit addresses the premature sending of `Respond` messages in
response to `AppendEntries` RPCs and similar requests. Previously,
responses could be sent before the associated IO operations were
confirmed as complete, potentially leading to inconsistencies.

Changes:

- The `Respond` now blocks until the corresponding IO operation
  is completed and properly recorded in `IOProgress`.

- `RequestVote` and `AppendEntries` RPC handlers now include a
  `Condition` that remains unsatisfied until operations like `SaveVote`
  or `AppendInputEntries` are flushed to disk.

- Add `IOProgress`:
    - **`accepted`:** Tracks the ID of the last IO operation that was accepted but not yet submitted.
    - **`submitted`:** The ID of the last IO operation submitted to either `RaftLogStorage` or `RaftStateMachine`.
    - **`flushed`:** The ID of the last IO operation successfully flushed to storage.

- Remove `command_seq`: Instead of using `command_seq` to track the
  completion of state machine commands, this update uses
  `IOState.applied` and `IOState.snapshot`. This change accounts for the
  non-sequential execution of some SM commands, such as `BuildSnapshot`,
  which runs in a separate task.

- Remove `IOState.vote`: The `IOState.vote` field has been
  removed. The system now utilizes
  `IOState.io_progress.flushed().voted_ref()` for tracking vote
  operations.
This commit refines the error handling within Openraft by eliminating
obsolete error variants previously used in defensive assertions. The
transition to using `validit::Valid` has made these variants redundant.

- **Removal of `Violation` and `DefensiveError`:** These components are
  no longer necessary and have been removed from the codebase.

- **Refactoring of `StorageError`:** The `enum StorageError {
  Defensive:.., IO:... }` has been replaced with `StorageIOError`.
  Consequently, `StorageError` is now a type alias of `StorageIOError`.

- **Deletion of `RaftLogReaderExt::get_log_entries()`:** This method,
  which relied on the removed `DefensiveError`, has been removed.

Upgrade tips:

- Replace any usage of `StorageError::IO{ source: StorageIOError }` with
  the simplified `StorageError`. Stop using `StorageIOError`.

- Substitute calls to `RaftLogReaderExt::get_log_entries()` with
  `RaftLogReader::try_get_log_entries()`.
Whne a `StorageError` occurs locally during replication,
the `RaftCore` has to be shutdown at once. in this case, just send a
generic notification `Notification::StorageError`. There is no need for
`replication::Response` to contain `StorageError`.
`#[test_harness::test(harness = func)]` builds the harness for test
function in a more structural way.
The unaffected Learner won't be removed during membership change.

- Discussion: databendlabs#1191
`RaftMetrics::last_quorum_acked` is the absolute timestamp of the most
recent time point that is accepted by a quorum via `AppendEntries` RPC.
This field is a wrapped `Instant` type: `SerdeInstant` which support
serde for `Instant`. This field is added as a replacement of
`millis_since_quorum_ack`, which is a relative time.

`SerdeInstant` serialize `Instant` into a string formatted as
"%Y-%m-%dT%H:%M:%S%.9f%z", e.g., "2024-07-24T04:07:32.567025000+0000".

Note: Serialization and deserialization are not perfectly accurate and
can be indeterministic, resulting in minor variations each time. These
deviations(could be smaller or greater) are typically less than a
microsecond (10^-6 seconds).
Copy link

coderabbitai bot commented Jul 24, 2024

Important

Review skipped

More than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review.

201 files out of 300 files are above the max files limit of 50. Please upgrade to Pro plan to get higher limits.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

mergify bot commented Jul 24, 2024

⚠️ The sha of the head commit of this PR conflicts with #94. Mergify cannot evaluate rules on this PR. ⚠️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants