replication: fetch replication log in a loop #350

neubaner · 2023-09-02T18:01:58Z

Since batch_log_entries sends up to 1024 frames, the sync operation should fetch all frames until it finds a empty page.

penberg · 2023-09-04T11:20:46Z

@psarna can you review this?

psarna

The issue here is more complicated than switching to log_entries gRPC stream. The log_entries stream is designed for long polls and long-standing connections, ones that continuously fetch new frames from the background and apply them. That's how sqld replicas fetch data. Because of that, log_entries blocks once there are now more frames to replicate, and indefinitely waits for new frames, assuming that they will come. Contrarily, we added batch_log_entries for applications that want to actively poll for new frames and never block. This non-blocking way is especially important for single-threaded apps, e.g. ones that run on WebAssembly - we just ask for new frames from time to time, but never actually block.

The changes in this PR switch to log_entries, but that leaves the blocking problem. Also, note that too-large-batches problem was recently fixed sqld-side, via libsql/sqld#648. There's now a limit of how large a batch of entries can be, and users (e.g. this code) are just expected to ask for more, which effectively implements pagination.

To sum up, I think the proper way to improve libsql code is to modify fn sync_from_http, so that instead of asking for fetch_log_entries once, it does so in a loop, until it hits that 0 pages were returned - and that indicates there are no more frames we can replicate in a non-blocking way.

the `BatchLogEntries` replication operation on sqld returns up to 1024 frames, so the client needs to fetch pages until the server returns a empty page.

neubaner · 2023-09-07T21:11:29Z

@psarna Sorry, I didn't realize that log_entries was a blocking operation. Updated the PR to fetch log entries in a loop until it find a empty page.

psarna · 2023-09-18T08:15:12Z

crates/replication/src/lib.rs

+
+            applied_frames += len;
+        }
+


The error path gets more complicated with a loop approach: what if we successfully applied 2048 frames, but then failed to replicate the next batch? Current approach just returns an error, but "next_offset" still points after the frames we partially applied. Maybe we should retry a few times, and only return an error if the attempts failed. And if we decide not to retry, which makes the code simpler, there's also another issue: sync_from_http already returns a Result<usize>, and that brings the question what to return if we applied some frames, but failed to apply some more recent ones. We should either return the number of partial frames and report "success", because some frames were applied after all, or create a new specialized error case, like PartialUpdate(usize). I think it's reasonable to return success on partial application, but let's discuss. /cc @penberg @LucioFranco @MarinPostma

Side note - there's a preexisting error here, because we bump next_offset in line 235 before sending frames. If sending fails, we end up with next_offset already updated, and the frames won't be fetched anymore. I'll fix that asap as a separate patch, but meanwhile, the new code should also only update next_offset after frames are successfully sent.
Back to error handling - I think we need to things:

Retry if getting frames failed, but some previous frames were already applied in this call - to prevent partial updates.

Clearly comment in the code that it's now possible to get a partial result

Decide if we return an Ok(usize) on partial application of frames, or some special error code. I think Ok(usize) is fair

I moved the next_offset update after injector.step() as in the patch you pushed.

The major downside I can see with returning Ok(usize) is that the user of the library would not be able to tell if it was a partial update. That's fine for sync because it runs in a loop anyway, but for sync_oneshot it would be nice to know it the sync was a partial, so the user can retry.

What do you think about returning something like an Ok(ReplicatedFrames) for sync_oneshot instead of an specialized error?

// Could be a struct with a boolean flag or enum enum ReplicatedFrames { Partial(usize), Full(usize), } impl ReplicatedFrames { /// Number of frames that were synced successfully pub fn frames_synced(&self) -> usize { match self { ReplicatedFrames::Partial(frames) => *frames, ReplicatedFrames::Full(frames) => *frames, } } /// A partial sync indicates that more frames _might_ be available to fetch pub fn sync_partially(&self) -> bool { matches!(self, ReplicatedFrames::Partial(_)) } }

Previous code bumped next_offset too early, and in case we failed to send and apply frames, it remained bumped. That means we'd never try to apply the frames again and end up with some of them ignored, and that's data corruption. Not a likely scenario, since it happens after the frames were fetched from the network, and it's only their local application, but still, technically risky. Refs tursodatabase#350

penberg · 2024-06-18T11:53:04Z

The pull request has conflicts and has not been updated for a while. Please feel free to open a new one if you wish to submit this for merging.

neubaner force-pushed the stream-replication-log branch from 957e42f to 474be62 Compare September 2, 2023 18:04

neubaner mentioned this pull request Sep 2, 2023

Consider moving to a stream replication for libsql or adding pagination to BatchLogEntries libsql/sqld#647

Open

psarna self-requested a review September 4, 2023 11:24

psarna reviewed Sep 4, 2023

View reviewed changes

fetch log entries in a loop.

2014861

the `BatchLogEntries` replication operation on sqld returns up to 1024 frames, so the client needs to fetch pages until the server returns a empty page.

neubaner force-pushed the stream-replication-log branch from 474be62 to 2014861 Compare September 7, 2023 21:07

neubaner changed the title ~~replication: stream log entries~~ replication: fetch replication log in a loop Sep 7, 2023

psarna reviewed Sep 18, 2023

View reviewed changes

psarna mentioned this pull request Sep 18, 2023

replication: fix bumping next_offset in http sync #398

Merged

Merge branch 'main' into stream-replication-log

d5d5946

penberg closed this Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replication: fetch replication log in a loop #350

replication: fetch replication log in a loop #350

neubaner commented Sep 2, 2023 •

edited

Loading

penberg commented Sep 4, 2023

psarna left a comment

neubaner commented Sep 7, 2023

psarna Sep 18, 2023

neubaner Sep 18, 2023

penberg commented Jun 18, 2024

replication: fetch replication log in a loop #350

replication: fetch replication log in a loop #350

Conversation

neubaner commented Sep 2, 2023 • edited Loading

penberg commented Sep 4, 2023

psarna left a comment

Choose a reason for hiding this comment

neubaner commented Sep 7, 2023

psarna Sep 18, 2023

Choose a reason for hiding this comment

neubaner Sep 18, 2023

Choose a reason for hiding this comment

penberg commented Jun 18, 2024

neubaner commented Sep 2, 2023 •

edited

Loading