/Duplicates

Find duplicated files and list them to stdout. During search output broken symlinks to stderr.

Primary LanguageCGNU General Public License v3.0GPL-3.0

Readme for program duplicates.

The aim of the program is to report files which are duplicated, ie
copied. To attain that aim the pathnames to all files under the user
input dir or dirs are recorded along with the size in bytes and inode
number. This file is then sorted on size, inode then pathname. That
file is then processed so that, any file with a unique size is
discarded. Then for any group of files having the same size and inode
number, all but the last one is discarded. That group necessarily are
all linked, hard linked or symlinked, it does not matter. At that poiint
files of the same size but different inodes are checked to see if any
byte differs within the first 128 Kbytes. Obviously if they diifer a
files is discarded. If not then the pair of files are md5summed. If the
hashes match then both files are recorded as being duplicates, otherwise
the first file is discarded.

The result so far is in order of md5sum, inode, then path. This is of
very little use to a human, so the groups of duplicates are grouped
by the path name of the first item in the group. That is then output to
stdout. If you redirect this to a file, such file may be used as input
to 'processdups' for manual processing.

I copied the md5.c, md5.h and other requirements from GNU coreutils-8.21
The necessary files were available after ./configure && make on that
package. I had tried libmhash but it took about 20% to 30% longer to
hash a file just ander 1 gig.