Readme for program duplicates. The aim of the program is to report files which are duplicated, ie copied. To attain that aim the pathnames to all files under the user input dir or dirs are recorded along with the size in bytes and inode number. This file is then sorted on size, inode then pathname. That file is then processed so that, any file with a unique size is discarded. Then for any group of files having the same size and inode number, all but the last one is discarded. That group necessarily are all linked, hard linked or symlinked, it does not matter. At that poiint files of the same size but different inodes are checked to see if any byte differs within the first 128 Kbytes. Obviously if they diifer a files is discarded. If not then the pair of files are md5summed. If the hashes match then both files are recorded as being duplicates, otherwise the first file is discarded. The result so far is in order of md5sum, inode, then path. This is of very little use to a human, so the groups of duplicates are grouped by the path name of the first item in the group. That is then output to stdout. If you redirect this to a file, such file may be used as input to 'processdups' for manual processing. I copied the md5.c, md5.h and other requirements from GNU coreutils-8.21 The necessary files were available after ./configure && make on that package. I had tried libmhash but it took about 20% to 30% longer to hash a file just ander 1 gig.
rlp1938/Duplicates
Find duplicated files and list them to stdout. During search output broken symlinks to stderr.
CGPL-3.0