/find-duplicate-images

Command line tool to find and identify duplicate images

Primary LanguagePythonMIT LicenseMIT

Python GitHub release License: MIT CI Language grade: Python

Find Duplicate Images

This project provides a command line program to look for duplicate images of a given search image. For example, with a target image and a target directory, the program will look to see what images in the target directory are the same as the target image.

The application works by using a special image hash function. A standard hash function such as SHA-1 cannot be used because these hash functions rely on the images to be exactly the same. Small differences in bytes will result in a completely different hash value. In short, these image hash functions work by hashing an image that has been distilled down to its core structure by removing things such as high frequency details. This application is built using the ImageHash Python package.

A common application for this tool is to sift through a collection of images for training a machine learning application where you don't want duplicates. You can imagine conglomerating images from various sources where duplicate images may exist but may have different quality and even have watermarks. Further, you want to prevent training images to be present in your test set.

Prerequisites

  1. Python 3
  2. (optional) Poetry

Installation

There are several ways you can install the application:

  • With Poetry and GNU Make: make install
  • With just Poetry: poetry install --no-dev
  • With pip: pip install .

Usage

Once you have the application installed, in the command line run:

find_duplicates path_image dir_search

where path_image is the path of the target image, the image you are looking for duplicates, and dir_search is the directory you are looking for duplicates of the target.

License

This project is distributed under the MIT license. Please see LICENSE for more information. The test images come from Unsplash, which provides freely usable images. You can read their license here. While attribution is not required by the license, it is encouraged.