Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dedupe feature. #5

Open
feluxe opened this issue Oct 12, 2017 · 0 comments
Open

Add dedupe feature. #5

feluxe opened this issue Oct 12, 2017 · 0 comments

Comments

@feluxe
Copy link
Owner

feluxe commented Oct 12, 2017

Dupe Replacement Feature

Vhpi already creates snapshots using hard-links, but vhpi doesn't know when files are moved around within a source directory. Moved files are being backuped into a new snapshot as if they were new files. The dedupe feature let's vhpi search for duplicate files among snapshots for each backup source and replaces them with hard-links, to keep the backup as slim as possible. I don't know (yet) if the Pi can handle the amount of overhead, thou. I think it would make most sense to add config options that allow to limit the amount of work that has to be done to find dupes. E.g. filter out all files that are less than 10MB, then search dupes. Or do only search dupes among snapshots that last a while, like 'monthly' and 'yearly' snapshots.

dedupe_min_file_size: xxx   # Files smaller than this, will be excluded from dedupe process.
dedupe_snaps: ['monthly, yearly, ..']  # Dedupe will only run on the snapshots listed here.
dudupe_interval: 'weekly'    # Define the dedupe interval. 

This feature should be totally optional.

Brainstorming

Search duplicate files across all snapshots via fdupes for each Backup source.
    Only absolute identical files with same permissions, timestamps, etc. are dupes.
Delete all duplicates and replace them with hardlinks. Kepp only one file for each dupe-group.
Add config option to let user set a custom interval for dupe removal.
Add an config option to define a minimum file size, only files that are bigger than set value are included in dupe removal. (Dupe removal does not make sense for little files.)
Add a config option to define which type of snapshots are to be included in dupe removal. (Dupe removal makes most sense to be used for snapshots that last long.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant