#Find duplicate files

This is a Perl script that locates duplicate files (ie. files with identical contents but at different inodes) with O(n) time and O(1) space overhead. It uses GNU sort that takes O(nlgn) time and O(n) space to sort the entries. This script manages to minimize the overheads needed to find duplicates. It should be among the most efficient scripts available on GitHub for the same purpose.

File duplicatiion is checked using String::CRC32 and Digest::MD5. You can use one or both of them as checksum.

Usage

Usage: find-duplicate [OPTIONS] {DIR,FILE}*
Options:
  --checksum=TYPE  Specify the type (crc, md5, both) of checksum. Default is crc.
  --digest         Don't find duplicates. Print checksums of the input files instead.
  --pipe           Don't compute checksums. Read checksums from STDIN instead.
  --parallel=N     Change the number of sorts run concurrently to N.
  --verbose        Print debug messages.
  --help           Print this message.

Given at least one DIR/FILE, the script checks the files contains in DIR or listed in FILE (which is expected to be the output of something like ls -1 DIR or find DIR). Note that the script ignores sub-directories and empty files. To check files recursively, you have to feed the filenames to the script using find. For example, to find duplicate files in directory images using MD5 as checksum, you can write

find-duplicate --checksum=md5 images

which is equivalent to (see discussion below):

find-duplicate --checksum=md5 --digest images | find-duplicate --pipe

Note that the above usages don't check files in the sub-directories. To check files recursively, use

find images | find-duplicate --checksum=md5 

Discussion

In practice, computing checksum is usually the most time-consuming stage in the process of finding duplicate files. Hence, it is sometimes preferrable to separate this stage from the rest of the process. An advantage of two-stage processing is that it gives you an opportunity to compute checksums in parallel. For example, suppose you want to compare files in folders B1, ..., Bn. Instead of invoking find-duplicate B1 ... Bn directly, you can parallelize the computation of checksums as follows:

for i in $(seq 1 $n); do # spawn $n processes
    find-duplicate --digest B$i > B$i.checksum &
done
wait 
cat *.checksum | find-duplicate --pipe && rm *.checksum

For a slightly more convoluted example, consider the situation that you want to compare files in folder A against those in folders B1, ..., Bn and you don't want to compare files in B1, ..., Bn with each other. In this case, computing the checksums first and then comparing the folders using the pre-computed checksums is far more efficient than comparing the folders in pairs directly.

find-duplicate --digest A > A.checksum
for i in $(seq 1 $n); do # spawn $n processes
    (   find-duplicate --digest B$i > B$i.checksum
        (   flock 200
            cat A.checksum B$i.checksum | find-duplicate --pipe
            rm B$i.checksum
        ) 200 > /tmp/.lock 
    ) &
done

Here we use flock to prevent the processes from writing to STDOUT simultaneously.