Transformed dataset derived from UKBench dataset: An image from different groups of the UKBench dataset was taken and the following 5 transformations were applied to the original image:
Random crop preserving the original aspect ratio (new size - 560 x 420)
Horizontal flip
Vertical flip
25 degree rotation
Resizing with change in aspect ratio (new aspect ratio - 1:1)
Thus, each group has a total of 6 images (original + transformed). A total of 1800 such groups were created totalling 10800 images in the dataset.
Exact duplicate dataset: An image from each of the 2550 image groups of the UKBench dataset was taken and an exact duplicate was created. The number of images totalled 5100.
The benchmarks were performed on an AWS ec2 r5.xlarge instance having 4 vCPUs and 32 GB memory. The instance does not have a GPU, so all the runs are done on CPUs.
The times are reported in seconds and comprise the time taken to generate encodings and find duplicates. The time taken to perform the evaluation task is NOT reported.
The cnn method with a threshold between 0.5 and 0.9 would work best for finding near duplicates. This is indicated by the extreme values class-1 precision and recall takes for the two thresholds.
Hashing methods do not perform well for finding near duplicates.
The cnn method with threshold 0.9 seems to work best for finding transformed duplicates. A slightly lower min_similarity_threshold value could lead to a higher class-1 recall.
Hashing methods do not perform well for finding transformed duplicates. In reality, resized images get found easily, but all other transformations lead to a bad performance for hashing methods.
* The value is low as opposed to the expected 1.0 because of the cosine_similarity function from scikit-learn (used within the package) which sometimes calculates the similarity to be slightly less than 1.0 even when the vectors are same.
Difference hashing is the fastest (max_distance_threshold 0).
When using hashing methods for exact duplicates, keep max_distance_threshold to a low value. The value of 0 is good, but a slightly higher value should also work fine.
When using cnn method, keep min_similarity_threshold to a high value. The default value of 0.9 seems to work well. A slightly higher value can also be used.
Near duplicate dataset: use cnn with an appropriate min_similarity_threshold.
Transformed dataset: use cnn with min_similarity_threshold of around 0.9 (default).
Exact duplicates dataset: use Difference hashing with 0 max_distance_threshold.
A higher max_distance_threshold (i.e., hashing) leads to a higher execution time. cnn method doesn't seem much affected by the min_similarity_threshold (though a lower value would add a few seconds to the execution time as can be seen in all the runs above.)
Generally speaking, the cnn method takes longer to run as compared to hashing methods for all datasets. If a GPU is available, cnn method should be much faster.
It might be desirable to only generate the hashes/cnn encodings for a given image or all images in a directory instead of directly deduplicating using find_duplicates method. Encodings can be generated for a directory of images or for a single image:
Encoding generation for all images in a directory
Encoding generation for a single image
"},{"location":"user_guide/encoding_generation/#encoding-generation-for-all-images-in-a-directory","title":"Encoding generation for all images in a directory","text":"
To generate encodings for all images in an image directory encode_images function can be used. The general api for using encode_images is:
from imagededup.methods import <method-name>\nmethod_object = <method-name>()\nencodings = method_object.encode_images(image_dir='path/to/image/directory')\n
where the returned variable encodings is a dictionary mapping image file names to corresponding encoding:
If an image in the image directory can't be loaded, no encodings are generated for the image. Hence, there is no entry for the image in the returned encodings dictionary.
from imagededup.methods import DHash\ndhasher = DHash()\nencoding = dhasher.encode_image(image_file='path/to/image/file')\n
"},{"location":"user_guide/evaluating_performance/","title":"Evaluation of deduplication quality","text":"
To determine the quality of deduplication algorithm and the corresponding threshold, an evaluation framework is provided.
Given a ground truth mapping consisting of file names and a list of duplicates for each file along with a retrieved mapping from the deduplication algorithm for the same files, the following metrics can be obtained using the framework:
Mean Average Precision (MAP)
Mean Normalized Discounted Cumulative Gain (NDCG)
Jaccard Index
Per class Precision (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)
Per class Recall (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)
Per class f1-score (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)
The api for obtaining these metrics is as below:
from imagededup.evaluation import evaluate\nmetrics = evaluate(ground_truth_map, retrieved_map, metric='<metric-name>')\n
where the returned variable metrics is a dictionary containing the following content:
{\n 'map': <map>,\n 'ndcg': <mean ndcg>,\n 'jaccard': <mean jaccard index>,\n 'precision': <numpy array having per class precision>,\n 'recall': <numpy array having per class recall>,\n 'f1-score': <numpy array having per class f1-score>,\n 'support': <numpy array having per class support>\n}\n
Presently, the ground truth map should be prepared manually by the user. Symmetric relations between duplicates must be represented in the ground truth map. If an image i is a duplicate of image j, then j must also be represented as a duplicate of i. Absence of symmetric relations will lead to an exception.
Both the ground_truth_map and retrieved_map must have the same keys.
There is a difference between the way information retrieval metrics(map, ndcg, jaccard index) and classification metrics(precision, recall, f1-score) treat the symmetric relationships in duplicates. Consider the following ground_truth_map and retrieved_map:
From the above, it can be seen that images '1.jpg' and '4.jpg' are not found to be duplicates of each other by the deduplication algorithm.
For calculating information retrieval metrics, each key in the maps is considered as an independent 'query'. In the ground truth, '4.jpg' is a duplicate of the key '1.jpg'. When it is not retrieved, it is considered a miss for query '1.jpg'. Similarly, '1.jpg' is a duplicate of the key '4.jpg' in the ground truth. When this is not retrieved, it is considered a miss for query '4.jpg'. Thus, the missing relationship is accounted for twice instead of just once.
Classification metrics, on the other hand, consider the relationships only once by forming unique pairs of images and labelling each pair as a 0 (non-duplicate image pair) and 1 (duplicate image pair).
Using the ground_truth_map, the ground truth pairs with the corresponding labels are:
These two sets of pairs are then used to calculate metrics such as precision/recall/f1-score. It can be seen that the missing relationship between pair ('1jpg', '4.jpg') is accounted for only once.
Each key in the duplicates dictionary corresponds to a file in the image directory passed to the image_dir parameter of the find_duplicates function. The value is a list of all file names in the image directory that were found to be duplicates for the key file. The 'method-name' corresponds to one of the deduplication methods available and can be set to:
image_dir: Optional, directory where all image files are present.
encoding_map: Optional, used instead of image_dir attribute. Set it equal to the dictionary of file names and corresponding encodings (hashes/cnn encodings). The mentioned dictionary can be generated using the corresponding encode_images method.
scores: Setting it to True returns the scores representing the hamming distance (for hashing) or cosine similarity (for cnn) of each of the duplicate file names from the key file. In this case, the returned 'duplicates' dictionary has the following content:
Each key in the duplicates dictionary corresponds to a file in the image directory passed to the image_dir parameter of the find_duplicates function. The value is a list of tuples representing the file names and corresponding scores in the image directory that were found to be duplicates of the key file.
outfile: Name of file to which the returned duplicates dictionary is to be written, must be a json. None by default.
threshold parameter:
min_similarity_threshold for cnn method indicating the minimum amount of cosine similarity that should exist between the key image and a candidate image so that the candidate image can be considered as a duplicate of the key image. Should be a float between -1.0 and 1.0. Default value is 0.9.
max_distance_threshold for hashing methods indicating the maximum amount of hamming distance that can exist between the key image and a candidate image so that the candidate image can be considered as a duplicate of the key image. Should be an int between 0 and 64. Default value is 10.
recursive: finding images recursively in a nested directory structure, set to False by default.
The returned duplicates dictionary contains symmetric relationships i.e., if an image i is a duplicate of image j, then image j must also be a duplicate of image i. Let's say that the image directory only consists of images i and j, then the duplicates dictionary would have the following content:
{\n 'i': ['j'],\n 'j': ['i']\n}\n
If an image in the image directory can't be loaded, no encodings are generated for the image. Hence, the image is disregarded for deduplication and has no entry in the returned duplicates dictionary.
To deduplicate an image directory using perceptual hashing, with a maximum allowed hamming distance of 12, scores returned along with duplicate filenames and the returned dictionary saved to file 'my_duplicates.json', use the following:
To deduplicate an image directory using cnn, with a minimum cosine similarity of 0.85, no scores returned and the returned dictionary saved to file 'my_duplicates.json', use the following:
Returns a list of files in the image directory that are considered as duplicates. Does NOT remove the said files.
The api is similar to find_duplicates function (except the score attribute in find_duplicates). This function allows the return of a single list of file names in directory that are found to be duplicates. The general api for the method is as below:
In this case, the returned variable duplicates is a list containing the name of image files that are found to be duplicates of some file in the directory:
image_dir: Optional, directory where all image files are present.
encoding_map: Optional, used instead of image_dir attribute. Set it equal to the dictionary of file names and corresponding encodings (hashes/cnn encodings). The mentioned dictionary can be generated using the corresponding encode_images method.
outfile: Name of file to which the returned duplicates dictionary is to be written, must be a json. None by default.
threshold parameter:
min_similarity_threshold for cnn method indicating the minimum amount of cosine similarity that should exist between the key image and a candidate image so that the candidate image can be considered as a duplicate for the key image. Should be a float between -1.0 and 1.0. Default value is 0.9.
max_distance_threshold for hashing methods indicating the maximum amount of hamming distance that can exist between the key image and a candidate image so that the candidate image can be considered as a duplicate for the key image. Should be an int between 0 and 64. Default value is 10.
recursive: finding images recursively in a nested directory structure, set to False by default.
This method must be used with caution. The symmetric nature of duplicates imposes an issue of marking one image as duplicate and the other as original. Consider the following duplicates dictionary:
In this case, it is possible to remove only 2.jpg which leaves 1.jpg and 3.jpg as non-duplicates of each other. However, it is also possible to remove both 1.jpg and 3.jpg leaving only 2.jpg. The find_duplicates_to_remove method can thus, return either of the outputs. In the above example, let's say that 1.jpg is retained, while its duplicate, 2.jpg, is marked as a duplicate. Once 2.jpg is marked as duplicate, its own found duplicates would be disregarded. Thus, 1.jpg and 3.jpg would not be considered as duplicates. So, the final return would be:
['2.jpg']\n
This leaves 1.jpg and 3.jpg as non-duplicates in the directory. If the user does not wish to impose this heuristic, it is advised to use find_duplicates function and use a custom heuristic to mark a file as duplicate.
If an image in the image directory can't be loaded, no encodings are generated for the image. Hence, the image is disregarded for deduplication and has no entry in the returned duplicates dictionary.
To deduplicate an image directory using perceptual hashing, with a maximum allowed hamming distance of 12, and the returned list saved to file 'my_duplicates.json', use the following:
To deduplicate an image directory using cnn, with a minimum cosine similarity of 0.85 and the returned list saved to file 'my_duplicates.json', use the following:
"},{"location":"user_guide/plotting_duplicates/","title":"Plotting duplicates of an image","text":"
Once a duplicate dictionary corresponding to an image directory has been obtained (using find_duplicates), duplicates for an image can be plotted using plot_duplicates method as below:
from imagededup.utils import plot_duplicates\nplot_duplicates(image_dir, duplicate_map, filename, outfile=None)\n
where filename is the file for which duplicates are to be plotted.
image_dir: Directory where all image files are present.
duplicate_map: A dictionary representing retrieved duplicates with filenames as key and a list of retrieved duplicate filenames as value. A duplicate_map with scores can also be passed (obtained from find_duplicates function with scores attribute set to True).
filename: Image file name for which duplicates are to be plotted.
outfile: Optional, name of the file the plot should be saved to. None by default.
Checks the sanity of the input image numpy array for cnn and converts the grayscale numpy array to rgb by repeating the array thrice along the 3rd dimension if a 2-dimensional image array is provided.
Load an image given its path. Returns an array version of optionally resized and grayed image. Only allows images of types described by img_formats argument.
from imagededup.utils import plot_duplicates\nplot_duplicates(image_dir='path/to/image/directory',\nduplicate_map=duplicate_map,\nfilename='path/to/image.jpg')\n
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Image Deduplicator (imagededup)","text":"
imagededup is a python package that simplifies the task of finding exact and near duplicates in an image collection.
This package provides functionality to make use of hashing algorithms that are particularly good at finding exact duplicates as well as convolutional neural networks which are also adept at finding near duplicates. An evaluation framework is also provided to judge the quality of deduplication for a given dataset.
Following details the functionality provided by the package:
Finding duplicates in a directory using one of the following algorithms:
Convolutional Neural Network (CNN) - Select from several prepackaged models or provide your own custom model.
Perceptual hashing (PHash)
Difference hashing (DHash)
Wavelet hashing (WHash)
Average hashing (AHash)
Generation of encodings for images using one of the above stated algorithms.
Framework to evaluate effectiveness of deduplication given a ground truth mapping.
Plotting duplicates found for a given image file.
Detailed documentation for the package can be found at: https://idealo.github.io/imagededup/
imagededup is compatible with Python 3.8+ and runs on Linux, MacOS X and Windows. It is distributed under the Apache 2.0 license.
Plot duplicates obtained for a given file (eg: 'ukbench00120.jpg') using the duplicates dictionary
from imagededup.utils import plot_duplicates\nplot_duplicates(image_dir='path/to/image/directory',\n duplicate_map=duplicates,\n filename='ukbench00120.jpg')\n
The output looks as below:
The complete code for the workflow is:
from imagededup.methods import PHash\nphasher = PHash()\n\n# Generate encodings for all images in an image directory\nencodings = phasher.encode_images(image_dir='path/to/image/directory')\n\n# Find duplicates using the generated encodings\nduplicates = phasher.find_duplicates(encoding_map=encodings)\n\n# plot duplicates obtained for a given file using the duplicates dictionary\nfrom imagededup.utils import plot_duplicates\nplot_duplicates(image_dir='path/to/image/directory',\n duplicate_map=duplicates,\n filename='ukbench00120.jpg')\n
It is also possible to use your own custom models for finding duplicates using the CNN method.
For examples, refer this part of the repository.
For more detailed usage of the package functionality, refer: https://idealo.github.io/imagededup/
Update: Provided benchmarks are only valid upto imagededup v0.2.2. The next releases have significant changes to all methods, so the current benchmarks may not hold.
Detailed benchmarks on speed and classification metrics for different methods have been provided in the documentation. Generally speaking, following conclusions can be made:
CNN works best for near duplicates and datasets containing transformations.
All deduplication methods fare well on datasets containing exact duplicates, but Difference hashing is the fastest.
Please cite Imagededup in your publications if this is useful for your research. Here is an example BibTeX entry:
@misc{idealods2019imagededup,\ntitle={Imagededup},\nauthor={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran},\nyear={2019},\nhowpublished={\\url{https://github.com/idealo/imagededup}},\n}\n
The feedback should be submitted by creating an issue on GitHub issues. Select the related template (bug report, feature request, or custom) and add the corresponding labels.
Copyright 2019 idealo internet GmbH. All rights reserved.
Apache License\n Version 2.0, January 2004\n http://www.apache.org/licenses/\n
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
Definitions.
\"License\" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
\"Licensor\" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
\"Legal Entity\" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, \"control\" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
\"You\" (or \"Your\") shall mean an individual or Legal Entity exercising permissions granted by this License.
\"Source\" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
\"Object\" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
\"Work\" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
\"Derivative Works\" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
\"Contribution\" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, \"submitted\" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as \"Not a Contribution.\"
\"Contributor\" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
(d) If the Work includes a \"NOTICE\" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following\nboilerplate notice, with the fields enclosed by brackets \"[]\"\nreplaced with your own identifying information. (Don't include\n the brackets!) The text should be enclosed in the appropriate\n comment syntax for the file format. We also recommend that a\n file or class name and description of purpose be included on the\n same \"printed page\" as the copyright notice for easier\n identification within third-party archives.\n
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0\n
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Given a ground truth map and a duplicate map retrieved from a deduplication algorithm, get metrics to evaluate the effectiveness of the applied deduplication algorithm.
ground_truth_map: A dictionary representing ground truth with filenames as key and a list of duplicate filenames as value.
retrieved_map: A dictionary representing retrieved duplicates with filenames as key and a list of retrieved duplicate filenames as value.
metric: Name of metric to be evaluated and returned. Accepted values are: 'map', 'ndcg', 'jaccard', 'classification', 'all'(default, returns every metric).
dictionary: A dictionary with metric name as key and corresponding calculated metric as the value. 'map', 'ndcg' and 'jaccard' return a single number denoting the corresponding information retrieval metric. 'classification' metrics include 'precision', 'recall' and 'f1-score' which are returned in the form of individual entries in the returned dictionary. The value for each of the classification metric is a numpy array with first entry as the score for non-duplicate file pairs(class-0) and second entry as the score for duplicate file pairs (class-1). Additionally, a support is also returned as another key with first entry denoting number of non-duplicate file pairs and second entry having duplicate file pairs.
"},{"location":"examples/CIFAR10_deduplication/","title":"CIFAR10 deduplication example","text":""},{"location":"examples/CIFAR10_deduplication/#install-imagededup-via-pypi","title":"Install imagededup via PyPI","text":"
!pip install imagededup\n
"},{"location":"examples/CIFAR10_deduplication/#download-cifar10-dataset-and-untar","title":"Download CIFAR10 dataset and untar","text":"
"},{"location":"examples/CIFAR10_deduplication/#create-working-directory-and-move-all-images-into-this-directory","title":"Create working directory and move all images into this directory","text":"
"},{"location":"examples/CIFAR10_deduplication/#find-duplicates-in-the-entire-dataset-with-cnn","title":"Find duplicates in the entire dataset with CNN","text":"
from imagededup.methods import CNN\n\ncnn = CNN()\nencodings = cnn.encode_images(image_dir=image_dir)\nduplicates = cnn.find_duplicates(encoding_map=encodings)\n
"},{"location":"examples/CIFAR10_deduplication/#do-some-imports-for-plotting","title":"Do some imports for plotting","text":"
from pathlib import Path\nfrom imagededup.utils import plot_duplicates\nimport matplotlib.pyplot as plt\nplt.rcParams['figure.figsize'] = (15, 10)\n
"},{"location":"examples/CIFAR10_deduplication/#find-and-plot-duplicates-in-the-test-set-with-cnn","title":"Find and plot duplicates in the test set with CNN","text":"
# test images are stored under '/content/cifar/test'\nfilenames_test = set([i.name for i in Path('/content/cifar/test').glob('*.png')])\n\nduplicates_test = {}\nfor k, v in duplicates.items():\n if k in filenames_test:\n tmp = [i for i in v if i in filenames_test]\n duplicates_test[k] = tmp\n\n# sort in descending order of duplicates\nduplicates_test = {k: v for k, v in sorted(duplicates_test.items(), key=lambda x: len(x[1]), reverse=True)}\n\n# plot duplicates found for some file\nplot_duplicates(image_dir=image_dir, duplicate_map=duplicates_test, filename=list(duplicates_test.keys())[0])\n
"},{"location":"examples/CIFAR10_deduplication/#find-and-plot-duplicates-in-the-train-set-with-cnn","title":"Find and plot duplicates in the train set with CNN","text":"
# train images are stored under '/content/cifar/train'\nfilenames_train = set([i.name for i in Path('/content/cifar/train').glob('*.png')])\n\nduplicates_train = {}\nfor k, v in duplicates.items():\n if k in filenames_train:\n tmp = [i for i in v if i in filenames_train]\n duplicates_train[k] = tmp\n\n\n# sort in descending order of duplicates\nduplicates_train = {k: v for k, v in sorted(duplicates_train.items(), key=lambda x: len(x[1]), reverse=True)}\n\n# plot duplicates found for some file\nplot_duplicates(image_dir=image_dir, duplicate_map=duplicates_train, filename=list(duplicates_train.keys())[0])\n
"},{"location":"examples/CIFAR10_deduplication/#examples-from-test-set-with-duplicates-in-train-set","title":"Examples from test set with duplicates in train set","text":"
# keep only filenames that are in test set have duplicates in train set\nduplicates_test_train = {}\nfor k, v in duplicates.items():\n if k in filenames_test:\n tmp = [i for i in v if i in filenames_train]\n duplicates_test_train[k] = tmp\n\n# sort in descending order of duplicates\nduplicates_test_train = {k: v for k, v in sorted(duplicates_test_train.items(), key=lambda x: len(x[1]), reverse=True)}\n\n# plot duplicates found for some file\nplot_duplicates(image_dir=image_dir, duplicate_map=duplicates_test_train, filename=list(duplicates_test_train.keys())[0])\n
Given ground truth dictionary and retrieved dictionary, return per class precision, recall and f1 score. Class 1 is assigned to duplicate file pairs while class 0 is for non-duplicate file pairs.
Initialize a dictionary for mapping file names and corresponding hashes and a distance function to be used for getting distance between two hash strings.
List of tuples of the form [(valid_retrieval_filename1: distance), (valid_retrieval_filename2: distance)]
"},{"location":"handlers/search/brute_force_cython/","title":"Brute force cython","text":""},{"location":"handlers/search/brute_force_cython/#class-bruteforcecython","title":"class BruteForceCython","text":"
Initialize a dictionary for mapping file names and corresponding hashes and a distance function to be used for getting distance between two hash strings.
Initialize a HashEval object which offers an interface to control hashing and search methods for desired dataset. Compute a map of duplicate images in the document space given certain input control parameters.
Find duplicates using CNN and/or generate CNN encodings given a single image or a directory of images.
The module can be used for 2 purposes: Encoding generation and duplicate detection.
Encodings generation: To propagate an image through a Convolutional Neural Network architecture and generate encodings. The generated encodings can be used at a later time for deduplication. Using the method 'encode_image', the CNN encodings for a single image can be obtained while the 'encode_images' method can be used to get encodings for all images in a directory.
Duplicate detection: Find duplicates either using the encoding mapping generated previously using 'encode_images' or using a Path to the directory that contains the images that need to be deduplicated. 'find_duplicates' and 'find_duplicates_to_remove' methods are provided to accomplish these tasks.
recursive: Optional, find images recursively in a nested image directory structure, set to False by default.
num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation (supported only on linux platform), set to 0 by default. 0 disables multiprocessing.
Find duplicates for each file. Take in path of the directory or encoding dictionary in which duplicates are to be detected above the given threshold. Return dictionary containing key as filename and value as a list of duplicate file names. Optionally, the cosine distances could be returned instead of just duplicate filenames for each query file.
image_dir: Path to the directory containing all the images or dictionary with keys as file names
encoding_map: Optional, used instead of image_dir, a dictionary containing mapping of filenames and corresponding CNN encodings.
min_similarity_threshold: Optional, threshold value (must be float between -1.0 and 1.0). Default is 0.9
scores: Optional, boolean indicating whether similarity scores are to be returned along with retrieved duplicates.
outfile: Optional, name of the file to save the results, must be a json. Default is None.
recursive: Optional, find images recursively in a nested image directory structure, set to False by default.
num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation (supported only on linux platform), set to 0 by default. 0 disables multiprocessing.
num_sim_workers: Optional, number of cpu cores to use for multiprocessing similarity computation, set to number of CPUs in the system by default. 0 disables multiprocessing.
dictionary: if scores is True, then a dictionary of the form {'image1.jpg': [('image1_duplicate1.jpg', score), ('image1_duplicate2.jpg', score)], 'image2.jpg': [] ..}. if scores is False, then a dictionary of the form {'image1.jpg': ['image1_duplicate1.jpg', 'image1_duplicate2.jpg'], 'image2.jpg':['image1_duplicate1.jpg',..], ..}
image_dir: Path to the directory containing all the images or dictionary with keys as file names and values as numpy arrays which represent the CNN encoding for the key image file.
encoding_map: Optional, used instead of image_dir, a dictionary containing mapping of filenames and corresponding CNN encodings.
min_similarity_threshold: Optional, threshold value (must be float between -1.0 and 1.0). Default is 0.9
outfile: Optional, name of the file to save the results, must be a json. Default is None.
recursive: Optional, find images recursively in a nested image directory structure, set to False by default.
num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation (supported only on linux platform), set to 0 by default. 0 disables multiprocessing.
num_sim_workers: Optional, number of cpu cores to use for multiprocessing similarity computation, set to number of CPUs in the system by default. 0 disables multiprocessing.
Find duplicates using hashing algorithms and/or generate hashes given a single image or a directory of images.
The module can be used for 2 purposes: Encoding generation and duplicate detection.
Encoding generation: To generate hashes using specific hashing method. The generated hashes can be used at a later time for deduplication. Using the method 'encode_image' from the specific hashing method object, the hash for a single image can be obtained while the 'encode_images' method can be used to get hashes for all images in a directory.
Duplicate detection: Find duplicates either using the encoding mapping generated previously using 'encode_images' or using a Path to the directory that contains the images that need to be deduplicated. 'find_duplicates' and 'find_duplicates_to_remove' methods are provided to accomplish these tasks.
Calculate the hamming distance between two hashes. If length of hashes is not 64 bits, then pads the length to be 64 for each hash and then calculates the hamming distance.
recursive: Optional, find images recursively in a nested image directory structure, set to False by default.
num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation, set to number of CPUs in the system by default. 0 disables multiprocessing.
dictionary: A dictionary that contains a mapping of filenames and corresponding 64 character hash string such as {'Image1.jpg': 'hash_string1', 'Image2.jpg': 'hash_string2', ...}
Find duplicates for each file. Takes in path of the directory or encoding dictionary in which duplicates are to be detected. All images with hamming distance less than or equal to the max_distance_threshold are regarded as duplicates. Returns dictionary containing key as filename and value as a list of duplicate file names. Optionally, the below the given hamming distance could be returned instead of just duplicate filenames for each query file.
image_dir: Path to the directory containing all the images or dictionary with keys as file names and values as hash strings for the key image file.
encoding_map: Optional, used instead of image_dir, a dictionary containing mapping of filenames and corresponding hashes.
max_distance_threshold: Optional, hamming distance between two images below which retrieved duplicates are valid. (must be an int between 0 and 64). Default is 10.
scores: Optional, boolean indicating whether Hamming distances are to be returned along with retrieved duplicates.
outfile: Optional, name of the file to save the results, must be a json. Default is None.
search_method: Algorithm used to retrieve duplicates. Default is brute_force_cython for Unix else bktree.
recursive: Optional, find images recursively in a nested image directory structure, set to False by default.
num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation, set to number of CPUs in the system by default. 0 disables multiprocessing.
num_dist_workers: Optional, number of cpu cores to use for multiprocessing distance computation, set to number of CPUs in the system by default. 0 disables multiprocessing.
duplicates dictionary: if scores is True, then a dictionary of the form {'image1.jpg': [('image1_duplicate1.jpg', score), ('image1_duplicate2.jpg', score)], 'image2.jpg': [] ..}. if scores is False, then a dictionary of the form {'image1.jpg': ['image1_duplicate1.jpg', 'image1_duplicate2.jpg'], 'image2.jpg':['image1_duplicate1.jpg',..], ..}
image_dir: Path to the directory containing all the images or dictionary with keys as file names and values as hash strings for the key image file.
encoding_map: Optional, used instead of image_dir, a dictionary containing mapping of filenames and corresponding hashes.
max_distance_threshold: Optional, hamming distance between two images below which retrieved duplicates are valid. (must be an int between 0 and 64). Default is 10.
outfile: Optional, name of the file to save the results, must be a json. Default is None.
recursive: Optional, find images recursively in a nested image directory structure, set to False by default.
num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation, set to number of CPUs in the system by default. 0 disables multiprocessing.
num_dist_workers: Optional, number of cpu cores to use for multiprocessing distance computation, set to number of CPUs in the system by default. 0 disables multiprocessing.
Inherits from Hashing base class and implements perceptual hashing (Implementation reference: http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html).
Offers all the functionality mentioned in hashing class.
Inherits from Hashing base class and implements average hashing. (Implementation reference: http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html)
Offers all the functionality mentioned in hashing class.
Inherits from Hashing base class and implements difference hashing. (Implementation reference: http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html)
Offers all the functionality mentioned in hashing class.
Inherits from Hashing base class and implements wavelet hashing. (Implementation reference: https://fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5)
Offers all the functionality mentioned in hashing class.
To gauge an idea of the speed and accuracy of the implemented algorithms, a benchmark has been provided on the UKBench dataset (zip file titled 'UKBench image collection' having size ~1.5G) and some variations derived from it.
Near duplicate dataset (UKBench dataset): This dataset has near duplicates that are arranged in groups of 4. There are a total of 2550 such groups amounting to a total of 10200 RGB images. The size of each image is 640 x 480 with jpg extension. The image below depicts 3 example groups from the UKBench dataset. Each row represents a group with the corresponding 4 images from the group.
Transformed dataset derived from UKBench dataset: An image from different groups of the UKBench dataset was taken and the following 5 transformations were applied to the original image:
Random crop preserving the original aspect ratio (new size - 560 x 420)
Horizontal flip
Vertical flip
25 degree rotation
Resizing with change in aspect ratio (new aspect ratio - 1:1)
Thus, each group has a total of 6 images (original + transformed). A total of 1800 such groups were created totalling 10800 images in the dataset.
Exact duplicate dataset: An image from each of the 2550 image groups of the UKBench dataset was taken and an exact duplicate was created. The number of images totalled 5100.
The benchmarks were performed on an AWS ec2 r5.xlarge instance having 4 vCPUs and 32 GB memory. The instance does not have a GPU, so all the runs are done on CPUs.
The times are reported in seconds and comprise the time taken to generate encodings and find duplicates. The time taken to perform the evaluation task is NOT reported.
The cnn method with a threshold between 0.5 and 0.9 would work best for finding near duplicates. This is indicated by the extreme values class-1 precision and recall takes for the two thresholds.
Hashing methods do not perform well for finding near duplicates.
The cnn method with threshold 0.9 seems to work best for finding transformed duplicates. A slightly lower min_similarity_threshold value could lead to a higher class-1 recall.
Hashing methods do not perform well for finding transformed duplicates. In reality, resized images get found easily, but all other transformations lead to a bad performance for hashing methods.
* The value is low as opposed to the expected 1.0 because of the cosine_similarity function from scikit-learn (used within the package) which sometimes calculates the similarity to be slightly less than 1.0 even when the vectors are same.
Difference hashing is the fastest (max_distance_threshold 0).
When using hashing methods for exact duplicates, keep max_distance_threshold to a low value. The value of 0 is good, but a slightly higher value should also work fine.
When using cnn method, keep min_similarity_threshold to a high value. The default value of 0.9 seems to work well. A slightly higher value can also be used.
Near duplicate dataset: use cnn with an appropriate min_similarity_threshold.
Transformed dataset: use cnn with min_similarity_threshold of around 0.9 (default).
Exact duplicates dataset: use Difference hashing with 0 max_distance_threshold.
A higher max_distance_threshold (i.e., hashing) leads to a higher execution time. cnn method doesn't seem much affected by the min_similarity_threshold (though a lower value would add a few seconds to the execution time as can be seen in all the runs above.)
Generally speaking, the cnn method takes longer to run as compared to hashing methods for all datasets. If a GPU is available, cnn method should be much faster.
"},{"location":"user_guide/custom_model/","title":"Using custom models for CNN","text":"
To allow users to use custom models for encoding generation, we provide a CustomModel construct which serves as a wrapper for a user-defined feature extractor. The CustomModel consists of the following attributes:
name: The name of the custom model. Can be set to any string.
model: A PyTorch model object, which is a subclass of torch.nn.Module and implements the forward method. The output of the forward method should be a tensor of shape (batch_size x features) . Alternatively, a __call__ method is also accepted.
transform: A function that transforms a PIL.Image object into a PyTorch tensor. Should correspond to the preprocessing logic of the supplied model.
CustomModel is provided while initializing the cnn object and can be used in the following 2 scenarios:
Using the models provided with the imagededup package. There are 3 models provided currently:
MobileNetV3 (MobileNetV3 Small)- This is the default.
ViT (Vision Transformer- B16 IMAGENET1K_SWAG_E2E_V1)
EfficientNet (EfficientNet B4- IMAGENET1K_V1)
from imagededup.methods import CNN\n\n# Get CustomModel construct\nfrom imagededup.utils import CustomModel\n\n# Get the prepackaged models from imagededup\nfrom imagededup.utils.models import ViT, MobilenetV3, EfficientNet\n\n\n# Declare a custom config with CustomModel, the prepackaged models come with a name and transform function\ncustom_config = CustomModel(name=EfficientNet.name,\n model=EfficientNet(), \n transform=EfficientNet.transform)\n\n# Use model_config argument to pass the custom config\ncnn = CNN(model_config=custom_config)\n\n# Use the model as usual\n...\n
2.Using a user-defined custom model.
from imagededup.methods import CNN\n\n# Get CustomModel construct\nfrom imagededup.utils import CustomModel\n\n# Import necessary pytorch constructs for initializing a custom feature extractor\nimport torch\nfrom torchvision.transforms import transforms\n\n# Declare custom feature extractor class\nclass MyModel(torch.nn.Module):\n transform = transforms.Compose(\n [\n transforms.ToTensor()\n ]\n )\n name = 'my_custom_model'\n\n def __init__(self):\n super().__init__()\n # Define the layers of the model here\n\n def forward(self, x):\n # Do something with x\n return x\n\ncustom_config = CustomModel(name=MyModel.name,\n model=MyModel(),\n transform=MyModel.transform)\n\ncnn = CNN(model_config=custom_config)\n\n# Use the model as usual\n...\n
It is not necessary to bundle name and transform functions with the model class. They can be passed separately as well.
Examples for both scenarios can be found in the examples section.
It might be desirable to only generate the hashes/cnn encodings for a given image or all images in a directory instead of directly deduplicating using find_duplicates method. Encodings can be generated for a directory of images or for a single image:
Encoding generation for all images in a directory
Encoding generation for a single image
"},{"location":"user_guide/encoding_generation/#encoding-generation-for-all-images-in-a-directory","title":"Encoding generation for all images in a directory","text":"
To generate encodings for all images in an image directory encode_images function can be used. The general api for using encode_images is:
from imagededup.methods import <method-name>\nmethod_object = <method-name>()\nencodings = method_object.encode_images(image_dir='path/to/image/directory')\n
where the returned variable encodings is a dictionary mapping image file names to corresponding encoding:
If an image in the image directory can't be loaded, no encodings are generated for the image. Hence, there is no entry for the image in the returned encodings dictionary.
from imagededup.methods import DHash\ndhasher = DHash()\nencoding = dhasher.encode_image(image_file='path/to/image/file')\n
"},{"location":"user_guide/evaluating_performance/","title":"Evaluation of deduplication quality","text":"
To determine the quality of deduplication algorithm and the corresponding threshold, an evaluation framework is provided.
Given a ground truth mapping consisting of file names and a list of duplicates for each file along with a retrieved mapping from the deduplication algorithm for the same files, the following metrics can be obtained using the framework:
Mean Average Precision (MAP)
Mean Normalized Discounted Cumulative Gain (NDCG)
Jaccard Index
Per class Precision (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)
Per class Recall (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)
Per class f1-score (class 0 = non-duplicate image pairs, class 1 = duplicate image pairs)
The api for obtaining these metrics is as below:
from imagededup.evaluation import evaluate\nmetrics = evaluate(ground_truth_map, retrieved_map, metric='<metric-name>')\n
where the returned variable metrics is a dictionary containing the following content:
{\n 'map': <map>,\n 'ndcg': <mean ndcg>,\n 'jaccard': <mean jaccard index>,\n 'precision': <numpy array having per class precision>,\n 'recall': <numpy array having per class recall>,\n 'f1-score': <numpy array having per class f1-score>,\n 'support': <numpy array having per class support>\n}\n
Presently, the ground truth map should be prepared manually by the user. Symmetric relations between duplicates must be represented in the ground truth map. If an image i is a duplicate of image j, then j must also be represented as a duplicate of i. Absence of symmetric relations will lead to an exception.
Both the ground_truth_map and retrieved_map must have the same keys.
There is a difference between the way information retrieval metrics(map, ndcg, jaccard index) and classification metrics(precision, recall, f1-score) treat the symmetric relationships in duplicates. Consider the following ground_truth_map and retrieved_map:
From the above, it can be seen that images '1.jpg' and '4.jpg' are not found to be duplicates of each other by the deduplication algorithm.
For calculating information retrieval metrics, each key in the maps is considered as an independent 'query'. In the ground truth, '4.jpg' is a duplicate of the key '1.jpg'. When it is not retrieved, it is considered a miss for query '1.jpg'. Similarly, '1.jpg' is a duplicate of the key '4.jpg' in the ground truth. When this is not retrieved, it is considered a miss for query '4.jpg'. Thus, the missing relationship is accounted for twice instead of just once.
Classification metrics, on the other hand, consider the relationships only once by forming unique pairs of images and labelling each pair as a 0 (non-duplicate image pair) and 1 (duplicate image pair).
Using the ground_truth_map, the ground truth pairs with the corresponding labels are:
These two sets of pairs are then used to calculate metrics such as precision/recall/f1-score. It can be seen that the missing relationship between pair ('1jpg', '4.jpg') is accounted for only once.
Each key in the duplicates dictionary corresponds to a file in the image directory passed to the image_dir parameter of the find_duplicates function. The value is a list of all file names in the image directory that were found to be duplicates for the key file. The 'method-name' corresponds to one of the deduplication methods available and can be set to:
image_dir: Optional, directory where all image files are present.
encoding_map: Optional, used instead of image_dir attribute. Set it equal to the dictionary of file names and corresponding encodings (hashes/cnn encodings). The mentioned dictionary can be generated using the corresponding encode_images method.
scores: Setting it to True returns the scores representing the hamming distance (for hashing) or cosine similarity (for cnn) of each of the duplicate file names from the key file. In this case, the returned 'duplicates' dictionary has the following content:
Each key in the duplicates dictionary corresponds to a file in the image directory passed to the image_dir parameter of the find_duplicates function. The value is a list of tuples representing the file names and corresponding scores in the image directory that were found to be duplicates of the key file.
outfile: Name of file to which the returned duplicates dictionary is to be written, must be a json. None by default.
threshold parameter:
min_similarity_threshold for cnn method indicating the minimum amount of cosine similarity that should exist between the key image and a candidate image so that the candidate image can be considered as a duplicate of the key image. Should be a float between -1.0 and 1.0. Default value is 0.9.
max_distance_threshold for hashing methods indicating the maximum amount of hamming distance that can exist between the key image and a candidate image so that the candidate image can be considered as a duplicate of the key image. Should be an int between 0 and 64. Default value is 10.
recursive: finding images recursively in a nested directory structure, set to False by default.
The returned duplicates dictionary contains symmetric relationships i.e., if an image i is a duplicate of image j, then image j must also be a duplicate of image i. Let's say that the image directory only consists of images i and j, then the duplicates dictionary would have the following content:
{\n 'i': ['j'],\n 'j': ['i']\n}\n
If an image in the image directory can't be loaded, no encodings are generated for the image. Hence, the image is disregarded for deduplication and has no entry in the returned duplicates dictionary.
To deduplicate an image directory using perceptual hashing, with a maximum allowed hamming distance of 12, scores returned along with duplicate filenames and the returned dictionary saved to file 'my_duplicates.json', use the following:
To deduplicate an image directory using cnn, with a minimum cosine similarity of 0.85, no scores returned and the returned dictionary saved to file 'my_duplicates.json', use the following:
Returns a list of files in the image directory that are considered as duplicates. Does NOT remove the said files.
The api is similar to find_duplicates function (except the score attribute in find_duplicates). This function allows the return of a single list of file names in directory that are found to be duplicates. The general api for the method is as below:
In this case, the returned variable duplicates is a list containing the name of image files that are found to be duplicates of some file in the directory:
image_dir: Optional, directory where all image files are present.
encoding_map: Optional, used instead of image_dir attribute. Set it equal to the dictionary of file names and corresponding encodings (hashes/cnn encodings). The mentioned dictionary can be generated using the corresponding encode_images method.
outfile: Name of file to which the returned duplicates dictionary is to be written, must be a json. None by default.
threshold parameter:
min_similarity_threshold for cnn method indicating the minimum amount of cosine similarity that should exist between the key image and a candidate image so that the candidate image can be considered as a duplicate for the key image. Should be a float between -1.0 and 1.0. Default value is 0.9.
max_distance_threshold for hashing methods indicating the maximum amount of hamming distance that can exist between the key image and a candidate image so that the candidate image can be considered as a duplicate for the key image. Should be an int between 0 and 64. Default value is 10.
recursive: finding images recursively in a nested directory structure, set to False by default.
This method must be used with caution. The symmetric nature of duplicates imposes an issue of marking one image as duplicate and the other as original. Consider the following duplicates dictionary:
In this case, it is possible to remove only 2.jpg which leaves 1.jpg and 3.jpg as non-duplicates of each other. However, it is also possible to remove both 1.jpg and 3.jpg leaving only 2.jpg. The find_duplicates_to_remove method can thus, return either of the outputs. In the above example, let's say that 1.jpg is retained, while its duplicate, 2.jpg, is marked as a duplicate. Once 2.jpg is marked as duplicate, its own found duplicates would be disregarded. Thus, 1.jpg and 3.jpg would not be considered as duplicates. So, the final return would be:
['2.jpg']\n
This leaves 1.jpg and 3.jpg as non-duplicates in the directory. If the user does not wish to impose this heuristic, it is advised to use find_duplicates function and use a custom heuristic to mark a file as duplicate.
If an image in the image directory can't be loaded, no encodings are generated for the image. Hence, the image is disregarded for deduplication and has no entry in the returned duplicates dictionary.
To deduplicate an image directory using perceptual hashing, with a maximum allowed hamming distance of 12, and the returned list saved to file 'my_duplicates.json', use the following:
To deduplicate an image directory using cnn, with a minimum cosine similarity of 0.85 and the returned list saved to file 'my_duplicates.json', use the following:
"},{"location":"user_guide/plotting_duplicates/","title":"Plotting duplicates of an image","text":"
Once a duplicate dictionary corresponding to an image directory has been obtained (using find_duplicates), duplicates for an image can be plotted using plot_duplicates method as below:
from imagededup.utils import plot_duplicates\nplot_duplicates(image_dir, duplicate_map, filename, outfile=None)\n
where filename is the file for which duplicates are to be plotted.
image_dir: Directory where all image files are present.
duplicate_map: A dictionary representing retrieved duplicates with filenames as key and a list of retrieved duplicate filenames as value. A duplicate_map with scores can also be passed (obtained from find_duplicates function with scores attribute set to True).
filename: Image file name for which duplicates are to be plotted.
outfile: Optional, name of the file the plot should be saved to. None by default.
Checks the sanity of the input image numpy array for cnn and converts the grayscale numpy array to rgb by repeating the array thrice along the 3rd dimension if a 2-dimensional image array is provided.
Load an image given its path. Returns an array version of optionally resized and grayed image. Only allows images of types described by img_formats argument.
name: The name of the custom model. Default is 'default_model'.
model: The PyTorch model object which is a subclass of torch.nn.Module and implements the forward method and output a tensor of shape (batch_size x features). Alternatively, a call method is also accepted.. Default is None.
transform: A function that transforms a PIL.Image object into a PyTorch tensor that will be applied to each image before being fed to the model. Should correspond to the preprocessing logic of the supplied model. Default is None.
from imagededup.utils import plot_duplicates\nplot_duplicates(image_dir='path/to/image/directory',\nduplicate_map=duplicate_map,\nfilename='path/to/image.jpg')\n
To allow users to use custom models for encoding generation, we provide a CustomModel construct which serves as a wrapper for a user-defined feature extractor. The CustomModel consists of the following attributes:
+
+
name: The name of the custom model. Can be set to any string.
+
model: A PyTorch model object, which is a subclass of torch.nn.Module and implements the forward method. The output of the forward method should be a tensor of shape (batch_size x features) . Alternatively, a __call__ method is also accepted.
+
transform: A function that transforms a PIL.Image object into a PyTorch tensor. Should correspond to the preprocessing logic of the supplied model.
+
+
CustomModel is provided while initializing the cnn object and can be used in the following 2 scenarios:
+
+
Using the models provided with the imagededup package.
+There are 3 models provided currently:
fromimagededup.methodsimportCNN
+
+# Get CustomModel construct
+fromimagededup.utilsimportCustomModel
+
+# Get the prepackaged models from imagededup
+fromimagededup.utils.modelsimportViT,MobilenetV3,EfficientNet
+
+
+# Declare a custom config with CustomModel, the prepackaged models come with a name and transform function
+custom_config=CustomModel(name=EfficientNet.name,
+ model=EfficientNet(),
+ transform=EfficientNet.transform)
+
+# Use model_config argument to pass the custom config
+cnn=CNN(model_config=custom_config)
+
+# Use the model as usual
+...
+
+
+
2.Using a user-defined custom model.
+
fromimagededup.methodsimportCNN
+
+# Get CustomModel construct
+fromimagededup.utilsimportCustomModel
+
+# Import necessary pytorch constructs for initializing a custom feature extractor
+importtorch
+fromtorchvision.transformsimporttransforms
+
+# Declare custom feature extractor class
+classMyModel(torch.nn.Module):
+ transform=transforms.Compose(
+ [
+ transforms.ToTensor()
+ ]
+ )
+ name='my_custom_model'
+
+ def__init__(self):
+ super().__init__()
+ # Define the layers of the model here
+
+ defforward(self,x):
+ # Do something with x
+ returnx
+
+custom_config=CustomModel(name=MyModel.name,
+ model=MyModel(),
+ transform=MyModel.transform)
+
+cnn=CNN(model_config=custom_config)
+
+# Use the model as usual
+...
+
+
+
It is not necessary to bundle name and transform functions with the model class. They can be passed separately as well.
+
Examples for both scenarios can be found in the examples section.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/user_guide/encoding_generation/index.html b/user_guide/encoding_generation/index.html
index 369e4971..bdf0db1a 100644
--- a/user_guide/encoding_generation/index.html
+++ b/user_guide/encoding_generation/index.html
@@ -14,10 +14,10 @@
-
+
-
+
@@ -110,6 +110,8 @@
+
+
@@ -386,8 +388,8 @@
A named tuple that can be used to initialize a custom PyTorch model.
+
Args
+
+
+
name: The name of the custom model. Default is 'default_model'.
+
+
+
model: The PyTorch model object which is a subclass of torch.nn.Module and implements the forward method and output a tensor of shape (batch_size x features). Alternatively, a call method is also accepted.. Default is None.
+
+
+
transform: A function that transforms a PIL.Image object into a PyTorch tensor that will be applied to each image before being fed to the model. Should correspond to the preprocessing logic of the supplied model. Default is None.
+
+
+
class MobilenetV3
+
__init__
+
def__init__()
+
+
+
Initialize a mobilenetv3 model, cuts it at the global average pooling layer and returns the output features.
+
forward
+
defforward(x)
+
+
+
class ViT
+
__init__
+
def__init__()
+
+
+
Initialize a ViT model, takes mean of the final encoder layer outputs and returns those as features for a given image.
+
forward
+
defforward(x)
+
+
+
class EfficientNet
+
__init__
+
def__init__()
+
+
+
Initializes an EfficientNet model, cuts it at the global average pooling layer and returns the output features.
+
forward
+
defforward(x)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/utils/plotter/index.html b/utils/plotter/index.html
index 90a5c635..0ad98544 100644
--- a/utils/plotter/index.html
+++ b/utils/plotter/index.html
@@ -14,10 +14,10 @@
-
+
-
+
@@ -110,6 +110,8 @@
+
+
@@ -282,8 +284,8 @@