You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If we are to avoid storing the full list of files in memory to figure out which aren't in inventory, some raw ideas to see if anything sticks
"treeification"/sorting of inventory paths
I think s5cmd relies on order as well, and does sorting of lists disk to accommodate large listings.
If listing is sorted, could be "zipped" with (sorted) local tree navigation and remove local files as soon as not observed in sorted inventory listing.
cons:
would require n*log(n) time to sort first if inventory not sorted.
would require "smart" sorting to operate in reasonable time and memory available
Move in place
abuse filesystem: upon initiating backup, do mv * .moveback/ and upon encountering a new key, move all prior versions (could be a full folder if a single file like likely to be in our dandi keystore for blobs/) in place. At the end .moveback/ would contain files which are no longer in inventory. Then we could utilize mtime to enforce some policy to retain only up until some age.
pros:
no extra memory involvement at all
cons:
would be quick for initial mv but then due to need to recreate all subfolders
recovery from interrupted operation would also be tricky
store only "hash" of the path
At 500 million keys, if we use e.g. md5sum we need only 16 bytes per each file, should require only ~8GB of RAM to store the listing of all paths in inventory. We would need it though as a "set", so might take more but still not prohibitively. Then for each path on filesystem we could check if in the set and remove if not.
cons:
not sure how fast lookup in a set of such size would be and how likely false positives/collisions
would still be n*log(n) if lookup log(n) so similar to "sort first" anyways
No description provided.
The text was updated successfully, but these errors were encountered: