/finddup

Duplicate file finder

Primary LanguagePythonMIT LicenseMIT

finddup is a Python3 script that searches directories for duplicate files using the filecmp module. It will print a list of files are duplicates of an 'original' copy. Note that only the duplicates (not the 'original') are output, making the list suitable for further action (e.g. use xargs to move/remove).

WARNING This should be considered alpha software. Use it with caution on any files/data that are important to you. Note that this script itself will not modify anything, but using the resulting output should be done with care.

Usage

    finddup [-h] [-v] [dirs [dirs ...]]

finddup can be provided a list of directories. It will consider the order of directories to be a "priority" ordering for selecting original/duplicate pairs. Files selected as original will be from earlier directories in the list, and duplicates from the latter directories if possible. Note that due to the underlying use of os.walk, the preference for original/duplicate files in any one directory tree is arbitrary.

Examples

Find duplicates recursively in current directory:

    finddup

Find duplicates recursively across a few directories (anywhere on filesystem):

    finddup dir1 dir2 /tmp/dir3

Find duplicates, with verbose output:

    finddup -v dir1

Get help:

    finddup -h

Development

'invoke' is used for common tasks.

Environment:

    virtualenv -p python3 venv
    source ./venv/bin/activate
    pip install -r requirements.txt

Unit tests:

    invoke test

Also:

    tox
    invoke html       # sphinx docs
    invoke html -l    # sphinx-autobuild

TODO

  • Add -0 support to list files with NULL delimiters (for xargs -0)
  • Support continuing from last known comparison across process quits
  • Revisit progress method to avoid up-front calcuation; consider bytes, not just file count
  • Add -i (or similar) switch to ignore directories / regexps?