Scripts for
- Finding and deleting duplicate files
- Finding and removing empty directories
deduplicate.py
- Script that generates the list of duplicate files and optionally deletes files from that list.
deemptydir.py
- Script that generates the list of directories containing no files and optionally deletes them.
dupfilters.py
- Class defining the functions used by deduplicate.py for sorting instances of duplicate files. All functions for sorting duplicates are defined in this class. Simple interface for extending with additional functions.
deduplicate.ini
- A configuration file referenced by deduplicate.py during some sorting functions. All values are example and should be modified to fit use. Place a modified copy in the base directory being searched to define a local configuration for that search.
All Python files tested exclusively with Python 3.5.3 on Debian 9.
NOTE: The scheme for comparing files is to first compare file size, and then to compare a CRC checksum of the first 4 MiB of the file. File Header information is not used. This means two files which are the same size and have an idential leading 4 MiB will be considered duplicate. This may be a problem for multiple versions of a disk image (where size is fixed and data is identical for significant portions of the file). This will likely not be changed unless an issue is opened.
deduplicate.py <dir> build
Builds a list in <dir>
and in each subdirectory (recursive) of the files in that directory.
Saves each list as .deduplicator_record
. Uses these lists to find which files are identical and writes a list of all duplicate instances to <dir>deduplicator_summary
--full
- If any directory already contains a
.deduplicator_record
, rename it to.deduplicator_record_prev
and generate a new one. --light
- If any directory already contains a
.deduplicator_record
, generate a new one using any data it has for files still in that directory. Currently does not save old file
Default behavior (neither option) is to skip generating any .deduplicator_record
which already exists.
deduplicate.py <dir> index
Builds a list of all files like the build
command but saves this to a file named deduplicator_index
. This file can then be moved to another directory which is then scanned for duplicates. Running the build
command in a directory with a deduplicator_index
file will consider the files listed in that file when building the deduplicator_summary
. This means deduplicator can find duplicate files across different folders or even filesystems. Any file containing the string deduplicator_index
will be read, so multiple indexes can be integrated into a single summarization process (deduplicator_index_1
, deduplicator_index_2
, …)
deduplicate.py <dir> list SORT
deduplicate.py <dir> delete SORT
Reads the list of duplicates at <dir>deduplicator_summary
and sorts them into primary and duplicate instances according to the function SORT
. Prints these sorted lists and exit or delete each file marked as duplicate after listing. (Normal functionality would typically involve running list
to check whether instances were sorted as intended before running delete
.)
depth
- sort instances by the number of directories in each file path
length
- sort instances by the length of each filename
date
- sort instances by their last modified date
dlist
- sort instances within particular subdirectories as duplicate (return 1 if match).
plist
- sort instances within particular subdirectories as primary (return 0 if match).
Each sort function returns a numeric value for sorting each list instance. Instances with the lowest value are considered primary instances.
For dlist
and plist
, the script will first check the <dir>
directory and then the directory of deduplicate.py
for a deduplicate.ini
file containing a list of the directories to sort by.
-a, --all
- Consider all instances with the lowest sort value as primary. (Default: keep the instance with lexicographically first filepath)
-p, --printall
- Print entries in sorted duplicate list where all instances are primary. (Default: only list copies when at least one instance is sorted as a duplicate)
Delete operations can be done from the summary if no files from the external index are being deleted, i.e. the file index contains primary locations. The best way to do this is delete by running deduplicate.py <dir> delete plist -a
and specifying the index file as a primary directory in deduplicte.ini
deduplicate.py <dir> dirs
Reads <dir>deduplicator_summary
to check if demo_fs/
or any subdirectory contains copies of all the files located in any other of these directories. Prints a list of these directories with the path of each directory which only contains files also in that directory.
For Example: if **DIR_A** has all the files at **DIR_B** and **DIR_B** has all the files at **DIR_C**, dirs
would return
- DIR_A
- DIR_B
- DIR_C
- DIR_B
- DIR_C
deduplicate.py <dir> clean
Deletes the .deduplicator_record
and .deduplicator_record_prev
files from <dir>
(if they exist) and from each nested subdirectory.
deemptydir.py <dir>
Find all empty directories (directories with no files or nonempty subdirectories) within <dir>
and list to console. Ignore any .deduplicator_*
files in this process. Directories which only contain empty subdirectories are listed instead of their subdirectories.
-d
- Delete all listed empty directories