Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracker/snapshot_file_content: avoid useless uploads after desync #30

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pkhuong
Copy link
Collaborator

@pkhuong pkhuong commented Oct 5, 2024

When the replicating (writer) Verneuil VFS notices that there is spooled metadata, but that it doesn't match the sqlite file, we currently always copy everything from scratch. This makes sense for long-lived processes, but isn't great (still correct, just wasteful!) for short-lived ones that run between regular writes to the sqlite db.

This commit notices when we have a valid spooled manifest, and it doesn't match the file on disk. We then force a full scan of the sqlite file, but still avoid uploading any chunk that happens to match what was in the manifest.

@pkhuong pkhuong force-pushed the pkhuong/warm-start branch from d4380b6 to 880fc1d Compare October 5, 2024 19:31
@pkhuong
Copy link
Collaborator Author

pkhuong commented Oct 5, 2024

Should note that the only assumption is that chunks in the manifest were already uploaded or are currently staged for upload. Apart from that, we assume no relationship between the manifest and the db file wrt correctness (except that they might happen to have the same bytes in the same place, a lot of the time).

@pkhuong pkhuong force-pushed the pkhuong/warm-start branch 2 times, most recently from 96beeb7 to 1e0040e Compare October 6, 2024 22:25
When the replicating (writer) Verneuil VFS notices that there is
spooled metadata, but that it doesn't match the sqlite file, we
currently always copy everything from scratch.  This makes sense
for long-lived processes, but isn't great (still correct, just
wasteful!) for short-lived ones that run between regular writes
to the sqlite db.

This commit adds logic to notice when we have a valid spooled manifest,
but it doesn't match the file on disk.  We then force a full scan of
the sqlite file, but still avoid uploading any chunk that happens
to match what was in the manifest.

This new feature does not assume any relationship between the file
on disk and the manifest we happen to find in the spooling directory.
The only assumption is that chunks in that manifest have been
uploaded, or are staged for upload.  If any chunk in the manifest
happens to match the corresponding page in the sqlite file, that's
great, if not, we just upload the new data, with content addressing.

The old behaviour (upload everything) is preserved when we don't
find a manifest, e.g., because the spooling directory is empty, or
because the manifest predates the most recent boot.

Independently, it might be interesting to treat the previous manifest
as a set of chunks that are known to be already uploaded or staged
for upload, and not stage them again.  In practice, we mostly expect
matches at identical page offset, but who knows with large blobs?
Clippy is right to point out that it's pretty big, and we just pass
it around for a while.
@pkhuong pkhuong force-pushed the pkhuong/warm-start branch from 1e0040e to b5f782c Compare October 6, 2024 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant