Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete files from backup that don't match any inventory items #33

Open
jwodder opened this issue Nov 19, 2024 · 1 comment
Open

Delete files from backup that don't match any inventory items #33

jwodder opened this issue Nov 19, 2024 · 1 comment
Milestone

Comments

@jwodder
Copy link
Member

jwodder commented Nov 19, 2024

No description provided.

@yarikoptic
Copy link
Member

If we are to avoid storing the full list of files in memory to figure out which aren't in inventory, some raw ideas to see if anything sticks

"treeification"/sorting of inventory paths

I think s5cmd relies on order as well, and does sorting of lists disk to accommodate large listings.
If listing is sorted, could be "zipped" with (sorted) local tree navigation and remove local files as soon as not observed in sorted inventory listing.

cons:

  • would require n*log(n) time to sort first if inventory not sorted.
  • would require "smart" sorting to operate in reasonable time and memory available

Move in place

abuse filesystem: upon initiating backup, do mv * .moveback/ and upon encountering a new key, move all prior versions (could be a full folder if a single file like likely to be in our dandi keystore for blobs/) in place. At the end .moveback/ would contain files which are no longer in inventory. Then we could utilize mtime to enforce some policy to retain only up until some age.

pros:

  • no extra memory involvement at all

cons:

  • would be quick for initial mv but then due to need to recreate all subfolders
  • recovery from interrupted operation would also be tricky

store only "hash" of the path

At 500 million keys, if we use e.g. md5sum we need only 16 bytes per each file, should require only ~8GB of RAM to store the listing of all paths in inventory. We would need it though as a "set", so might take more but still not prohibitively. Then for each path on filesystem we could check if in the set and remove if not.

cons:

  • not sure how fast lookup in a set of such size would be and how likely false positives/collisions
  • would still be n*log(n) if lookup log(n) so similar to "sort first" anyways

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants