Data Leakage Detection and De-duplication in Large Scale Image Datasets

Yeshwanth Kumar Adimoolam, Bodhiswatta Chatterjee, Charalambos Poullis and Melinos Averkiou

Updates

April 13, 2023 - We release a data inspection web interface to manually inspect the extend of data leakage and duplication in the CrowdAI Mapping Challenge dataset. The web interface can be found at datainspector.app

Highlights

We propose an easy-to-adopt de-duplication and leakage detection pipeline for large-scale image datasets that utilizes collision detection of perceptual hashes of images.
We employ the proposed de-duplication pipeline to identify and eliminate instances of data duplication and leakage in the CrowdAI mapping challenge dataset. Approximately 250k of the 280k training images were either exact or augmented duplicates.
We demonstrate cases of significant overfitting of the recent state-of-the-art methods, potentially invalidating a number of prior art reporting on this dataset for the task of building footprint extraction.

Installation

conda create -n hash_and_search python=3.10
conda activate hash_and_search
pip install -r requirements.txt

Alternatively, the following requirements can be installed manually:

ImageHash
numpy
Pillow
PyWavelets
scipy
tqdm

Compute Hashes

To compute p-hashes for images in a folder, run:

python compute_hashes.py <input_images_directory> <output_directory> <output_hashtable_filename>

To compute p-hashes of augmented images in the dataset, run:

python compute_hashes_augmented.py <input_images_directory> <output_directory> <output_hashtable_filename>

Compare Hashes

Once hashtables are constructed for two image datasets, it is possible to compare the hashtables to detect duplicates using the following command:

python compare_hashes.py <needles_hashtable> <haystack_hashtable> <output_filename>

The above command results in a .json file containing all instances of duplicates in the haystack set for each image in the needles set.

Visualise Duplicates

To inspect and visualise these duplicates between the needles and haystack sets, run:

python inspect_hashes.py
python json_to_html.py

These commands would generate a HTML file that can be opened in any standard web browser. To view the HTML file:

Download the CrowdAI dataset train split images from here.
Place the train images in the same folder as the HTML file in the following directory structure: ./data/train/images/<place_images_here>.
```
└───data
    └───train
        └───images
            └───<place_images_here.>
```
Open the HTML file in a standard web browser (e.g., Google Chrome).

Dataset

Download link coming soon...

Acknowledgement

This repository benefits from

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
train_hashes		train_hashes
val_hashes		val_hashes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
compare_crowdai_filesizes.py		compare_crowdai_filesizes.py
compare_hashes.py		compare_hashes.py
compare_hashes_hamming_distance.py		compare_hashes_hamming_distance.py
compute_hashes.py		compute_hashes.py
compute_hashes_augmented.py		compute_hashes_augmented.py
inspect_hashes.py		inspect_hashes.py
json_to_html.py		json_to_html.py
requirements.txt		requirements.txt
run.sh		run.sh
val_in_train_np.json		val_in_train_np.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Leakage Detection and De-duplication in Large Scale Image Datasets

Yeshwanth Kumar Adimoolam, Bodhiswatta Chatterjee, Charalambos Poullis and Melinos Averkiou

Updates

Highlights

Installation

Compute Hashes

Compare Hashes

Visualise Duplicates

Dataset

Acknowledgement

About

Releases

Packages

Languages

License

yeshwanth95/CrowdAI_Hash_and_search

Folders and files

Latest commit

History

Repository files navigation

Data Leakage Detection and De-duplication in Large Scale Image Datasets

Yeshwanth Kumar Adimoolam, Bodhiswatta Chatterjee, Charalambos Poullis and Melinos Averkiou

Updates

Highlights

Installation

Compute Hashes

Compare Hashes

Visualise Duplicates

Dataset

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages