/deduplicator

Command line tool to summarize file tree information and provide macros for deduplication.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Deduplicator

Scripts for

  • Finding and deleting duplicate files
  • Finding and removing empty directories

File List

deduplicate.py
Script that generates the list of duplicate files and optionally deletes files from that list.
deemptydir.py
Script that generates the list of directories containing no files and optionally deletes them.
dupfilters.py
Class defining the functions used by deduplicate.py for sorting instances of duplicate files. All functions for sorting duplicates are defined in this class. Simple interface for extending with additional functions.
deduplicate.ini
A configuration file referenced by deduplicate.py during some sorting functions. All values are example and should be modified to fit use. Place a modified copy in the base directory being searched to define a local configuration for that search.

All Python files tested exclusively with Python 3.5.3 on Debian 9.

Usage

NOTE: The scheme for comparing files is to first compare file size, and then to compare a CRC checksum of the first 4 MiB of the file. File Header information is not used. This means two files which are the same size and have an idential leading 4 MiB will be considered duplicate. This may be a problem for multiple versions of a disk image (where size is fixed and data is identical for significant portions of the file). This will likely not be changed unless an issue is opened.

Initialization

deduplicate.py <dir> build Builds a list in <dir> and in each subdirectory (recursive) of the files in that directory. Saves each list as .deduplicator_record. Uses these lists to find which files are identical and writes a list of all duplicate instances to <dir>deduplicator_summary

Options:

--full
If any directory already contains a .deduplicator_record, rename it to .deduplicator_record_prev and generate a new one.
--light
If any directory already contains a .deduplicator_record, generate a new one using any data it has for files still in that directory. Currently does not save old file

Default behavior (neither option) is to skip generating any .deduplicator_record which already exists.

Building a deduplicator_index file

deduplicate.py <dir> index Builds a list of all files like the build command but saves this to a file named deduplicator_index. This file can then be moved to another directory which is then scanned for duplicates. Running the build command in a directory with a deduplicator_index file will consider the files listed in that file when building the deduplicator_summary. This means deduplicator can find duplicate files across different folders or even filesystems. Any file containing the string deduplicator_index will be read, so multiple indexes can be integrated into a single summarization process (deduplicator_index_1, deduplicator_index_2, …)

Listing and Deleting Files

deduplicate.py <dir> list SORT deduplicate.py <dir> delete SORT Reads the list of duplicates at <dir>deduplicator_summary and sorts them into primary and duplicate instances according to the function SORT. Prints these sorted lists and exit or delete each file marked as duplicate after listing. (Normal functionality would typically involve running list to check whether instances were sorted as intended before running delete.)

SORT function keywords:

depth
sort instances by the number of directories in each file path
length
sort instances by the length of each filename
date
sort instances by their last modified date
dlist
sort instances within particular subdirectories as duplicate (return 1 if match).
plist
sort instances within particular subdirectories as primary (return 0 if match).

Each sort function returns a numeric value for sorting each list instance. Instances with the lowest value are considered primary instances.

For dlist and plist, the script will first check the <dir> directory and then the directory of deduplicate.py for a deduplicate.ini file containing a list of the directories to sort by.

Options:

-a, --all
Consider all instances with the lowest sort value as primary. (Default: keep the instance with lexicographically first filepath)
-p, --printall
Print entries in sorted duplicate list where all instances are primary. (Default: only list copies when at least one instance is sorted as a duplicate)

Delete operations with deduplicator_index files

Delete operations can be done from the summary if no files from the external index are being deleted, i.e. the file index contains primary locations. The best way to do this is delete by running deduplicate.py <dir> delete plist -a and specifying the index file as a primary directory in deduplicte.ini

Listing Duplicate Directories

deduplicate.py <dir> dirs Reads <dir>deduplicator_summary to check if demo_fs/ or any subdirectory contains copies of all the files located in any other of these directories. Prints a list of these directories with the path of each directory which only contains files also in that directory.

For Example: if **DIR_A** has all the files at **DIR_B** and **DIR_B** has all the files at **DIR_C**, dirs would return

  • DIR_A
    • DIR_B
    • DIR_C
  • DIR_B
    • DIR_C

Removing .deduplicator_\* Files

deduplicate.py <dir> clean Deletes the .deduplicator_record and .deduplicator_record_prev files from <dir> (if they exist) and from each nested subdirectory.

Finding and Deleting Empty Folders

deemptydir.py <dir> Find all empty directories (directories with no files or nonempty subdirectories) within <dir> and list to console. Ignore any .deduplicator_* files in this process. Directories which only contain empty subdirectories are listed instead of their subdirectories.

Options:

-d
Delete all listed empty directories