/duped

Find duplicate files and produce deletion list based on user specifications

Primary LanguagePython

Simple tool for identifying duplicate files and providing a list of files to delete based on your specifications.  This tool does not perform any file removal of it's own unless you specifically direct it to by using the 'delete' command.

Let's say that you may have at some point made multiple copies of a directory as a backup and now wish to just delete any files which are duplicates in those backups but not touch the original directory at all:

~/dups/original
~/dups/copy1
~/dups/copy2
~/dups/copy3

Identifying duplicates is a multiple stage process. First you build the hash database.

j@c:~/duped$ python3 duped.py build --skip .git ~/dups
calculating file hashes

Working directory is /home/j/duped/duped_results_11738

Now you have a working directory that has a database of file hashes.  Next, let's parse that database and generate lists of files to keep or delete based on the directories we specify.  The arguments to the process command are the work directory location and the directories you are willing to delete files from.

j@c:~/duped$ python3 duped.py process --work_dir /home/j/duped/duped_results_11738 ~/dups/copy1 ~/dups/copy2 ~/dups/copy3
processing files
writing results into /home/j/duped/duped_results_11738
keep
delete
error
j@c:~/duped$ 

In the working directory you have a list of files that are proposed for deletion and a list of files that are to keep.  Only files in the copy[1-3] directories that you speicified in the 'process' command will be added to the delete list.  You can now review these file lists and see if they make sense.  Then you can either use your own tools to delete files from the delete list or run the 'delete' command with the exact same arguments as the 'process' command and the tool will delete the files for you.

This tool was made for my specific use but is provided in case it's useful for anyone else.