Deduping large amount of images (1mil+) #191

FluffyDiscord · 2023-04-14T11:51:15Z

AFAIK we can encode single or multiple images at once, collect all encodings and then we need to pass the whole encoding dictionary to find dupes.

Would it be possible to add options to save encodings to file in format such as HDF5, then reference this file as dictionary for deduping and make it so that the whole file will not be loaded in memory, but be streamed/work in batches/loops - for example load first 1000 encodings, try to find dupes in this batch, hold onto only the closest scores, then repeat for next 1000 encodings etc?

I already have around half a million images that will grow up to ~4mil and I need to dedupe them all. Also I need to be able to check for duplicates for single image, when it is added later on. Running the whole encoding/dedupe process for each single new image or loading saved encodings all in memory is not an option in these conditions.

tanujjain · 2023-04-21T10:41:36Z

We're currently experimenting with some large scale similarity frameworks that should be able to handle approximate deduplication of 1 million+ images and hope to make a release in 2-3 months. Some of these frameworks have the ability to handle memory constraints already.

Streaming is a good idea for memory reduction, but will most likely also come with reduction in deduplication quality. From the point of view of feature planning, we'd prefer to finish experimentation with large scale similarity frameworks before looking into streaming.

I'll leave the issue open to keep track of the request.

FluffyDiscord · 2023-04-21T11:50:50Z

I am available for testing purposes if needed as I already have huge collection of images ready to be deduplicated. My PC setup: RTX4090, 32GB RAM and Windows or Linux (PopOS)

Thank you for your time

Joshfindit · 2023-04-21T20:18:58Z

One technique I currently use for deduplication bit-for-bit files is hardlinking on-drive. It works excellently for large datasets as long as you architect it with that in mind.

To take an example from git: filename is hash + filesize, and the files are stored in subfolders that are the start of the filename (this avoids OS issues when a single folder has “too many files”). So a 200KB file with the SHA hash of cd611130182d1b9bd84955e07ca5270df9a09640 becomes cd/61/11/30/18/cd611130182d1b9bd84955e07ca5270df9a09640.200000

lookups are at drive speed when comparing a file that’s just been hashed.

This does not cover images that share a perceptual hash or are perceptually the same, but a script could be written with the same concepts and in a way that uses very minimal memory as a short-term tool until imagededup can handle pools that large.

juhonkang · 2023-06-06T05:52:49Z

@Joshfindit could we connect, I also have the same questions for large dataset and want to ask you :)

Joshfindit · 2023-06-06T13:03:27Z

@juhonkang Sure. Emailed your gmail.

ming076 · 2023-08-22T07:45:38Z

@tanujjain Excuse me， I wonder is the release to dedupe large amount of images avaliable now?

jzx-gooner · 2024-03-07T02:37:45Z

@tanujjain Cool work！Looking forward the new release and i can help to test！

sezan92 · 2024-09-30T04:45:33Z

is this feature released for large datasets?

tanujjain added the research topic Everything for researching and experimenting. label Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduping large amount of images (1mil+) #191

Deduping large amount of images (1mil+) #191

FluffyDiscord commented Apr 14, 2023

tanujjain commented Apr 21, 2023

FluffyDiscord commented Apr 21, 2023

Joshfindit commented Apr 21, 2023 •

edited

Loading

juhonkang commented Jun 6, 2023

Joshfindit commented Jun 6, 2023

ming076 commented Aug 22, 2023

jzx-gooner commented Mar 7, 2024

sezan92 commented Sep 30, 2024

Deduping large amount of images (1mil+) #191

Deduping large amount of images (1mil+) #191

Comments

FluffyDiscord commented Apr 14, 2023

tanujjain commented Apr 21, 2023

FluffyDiscord commented Apr 21, 2023

Joshfindit commented Apr 21, 2023 • edited Loading

juhonkang commented Jun 6, 2023

Joshfindit commented Jun 6, 2023

ming076 commented Aug 22, 2023

jzx-gooner commented Mar 7, 2024

sezan92 commented Sep 30, 2024

Joshfindit commented Apr 21, 2023 •

edited

Loading