Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduping large amount of images (1mil+) #191

Open
FluffyDiscord opened this issue Apr 14, 2023 · 8 comments
Open

Deduping large amount of images (1mil+) #191

FluffyDiscord opened this issue Apr 14, 2023 · 8 comments
Labels
research topic Everything for researching and experimenting.

Comments

@FluffyDiscord
Copy link

AFAIK we can encode single or multiple images at once, collect all encodings and then we need to pass the whole encoding dictionary to find dupes.

Would it be possible to add options to save encodings to file in format such as HDF5, then reference this file as dictionary for deduping and make it so that the whole file will not be loaded in memory, but be streamed/work in batches/loops - for example load first 1000 encodings, try to find dupes in this batch, hold onto only the closest scores, then repeat for next 1000 encodings etc?

I already have around half a million images that will grow up to ~4mil and I need to dedupe them all. Also I need to be able to check for duplicates for single image, when it is added later on. Running the whole encoding/dedupe process for each single new image or loading saved encodings all in memory is not an option in these conditions.

@tanujjain
Copy link
Collaborator

We're currently experimenting with some large scale similarity frameworks that should be able to handle approximate deduplication of 1 million+ images and hope to make a release in 2-3 months. Some of these frameworks have the ability to handle memory constraints already.

Streaming is a good idea for memory reduction, but will most likely also come with reduction in deduplication quality. From the point of view of feature planning, we'd prefer to finish experimentation with large scale similarity frameworks before looking into streaming.

I'll leave the issue open to keep track of the request.

@tanujjain tanujjain added the research topic Everything for researching and experimenting. label Apr 21, 2023
@FluffyDiscord
Copy link
Author

I am available for testing purposes if needed as I already have huge collection of images ready to be deduplicated. My PC setup: RTX4090, 32GB RAM and Windows or Linux (PopOS)

Thank you for your time

@Joshfindit
Copy link

Joshfindit commented Apr 21, 2023

One technique I currently use for deduplication bit-for-bit files is hardlinking on-drive. It works excellently for large datasets as long as you architect it with that in mind.

To take an example from git: filename is hash + filesize, and the files are stored in subfolders that are the start of the filename (this avoids OS issues when a single folder has “too many files”). So a 200KB file with the SHA hash of cd611130182d1b9bd84955e07ca5270df9a09640 becomes cd/61/11/30/18/cd611130182d1b9bd84955e07ca5270df9a09640.200000

lookups are at drive speed when comparing a file that’s just been hashed.

This does not cover images that share a perceptual hash or are perceptually the same, but a script could be written with the same concepts and in a way that uses very minimal memory as a short-term tool until imagededup can handle pools that large.

@juhonkang
Copy link

@Joshfindit could we connect, I also have the same questions for large dataset and want to ask you :)

@Joshfindit
Copy link

@juhonkang Sure. Emailed your gmail.

@ming076
Copy link

ming076 commented Aug 22, 2023

@tanujjain Excuse me, I wonder is the release to dedupe large amount of images avaliable now?

@jzx-gooner
Copy link

@tanujjain Cool work!Looking forward the new release and i can help to test!

@sezan92
Copy link

sezan92 commented Sep 30, 2024

is this feature released for large datasets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research topic Everything for researching and experimenting.
Projects
None yet
Development

No branches or pull requests

7 participants