dedup: A Haskell repository from thumphries

#deduplication tool

Command line tool to list or delete redundant files in a directory tree.

Usage: dedup PATH [-r|--recursive] [-d|--delete] [-l|--longer-names]
                  [-k|--list-keepers]

WARNING: This was an afternoon hack. Not exhaustively tested, and doesn't handle all IO errors gracefully. In particular, use the -d flag with care.

dedup identifies duplicates by first comparing file sizes, then SHA256 hashes. For each group of identical file, it selects the shorter filename (by default) as 'keeper', then prints the others to stdout as deletion candidates.

Alternatively, supply the -d flag and all deletion candidates will be deleted. Use this with care - don't get hit by a race condition. You may prefer to first salvage the 'keep' candidates - these can be listed with -k.

Pass -l or --longer-names to keep longer filenames. By default, only the duplicate with the shortest name will be kept.

##Examples

List deletion candidates for a folder: dedup /path/to/purge
Actually delete all such candidates from a folder: dedup -d /path/to/purge
... delete recursively: dedup -r -d /path/to/purge
List the files to be kept for a folder: dedup -k /path/to/purge

Race condition

If a file on the 'keep' list is deleted / modified during dedup -d operation, that data (and all redundant copies) will be totally lost. For this reason, it is best to backup the keepers with something like dedup -k /path/to/purge | xargs -I{} cp "{}" /backup/path/ first.

thumphries/dedup

Race condition