Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] File deduplication #6332

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

[WIP] File deduplication #6332

wants to merge 22 commits into from

Conversation

Hocuri
Copy link
Collaborator

@Hocuri Hocuri commented Dec 11, 2024

Instead of deduplicating by default, this adds a new function set_file_and_deduplicate(). When receiving messages, blobs will be deduplicated by default already.

This is for #6265; read the issue description there for more details.

TODO:

  • Set files as read-only
  • Don't do a write when the file is already identical
  • The first 32 chars or so of the 64-character hash are enough. I calculated that if 10b people (i.e. all of humanity) use DC, and each of them has 200k distinct blob files (I have 4k in my day-to-day account), and we used 20 chars, then the expected value for the number of name collisions would be ~0.0002 (and the probability that there is a least one name collision is lower than that) 1. I added 12 more characters to be on the super safe side, but this wouldn't be necessary and I could also make it 20 instead of 32.
    • Not 100% sure whether that's necessary at all - it would mainly be necessary if we might hit a length limit on some file systems (the blobdir is usually sth like accounts/2ff9fc096d2f46b6832b24a1ed99c0d6/dc.db-blobs (53 chars), plus 64 chars for the filename would be 117).
  • "touch" the files to prevent them from being deleted
  • TODOs in the code

For later PRs:

  • Replace BlobObject::create(…) with BlobObject::create_and_deduplicate(…) in order to deduplicate everytime core creates a file
  • Modify JsonRPC to deduplicate blob files

Footnotes

  1. Calculated with both https://printfn.github.io/fend/ and https://www.geogebra.org/calculator, both of which came to the same result (1,
    2)

@Hocuri Hocuri force-pushed the hoc/file-deduplication branch from abbaa2e to 3cb9a66 Compare December 12, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant