dedup
shell script to remove duplicate files
Background
History
Greatly inspired by rdfind, but limited by lack of certain features and inability to add them due to lack of C++ knowledge, I decided to do what? Right, to implement my own version, in plain shell, using a database.
Idea
Idea is simple: have a database with a table, with one row for each file, and columns for each property (like size, first and last bytes, checksum, etc). Different scriptlets will add such columns, and an SQL query will delete unique rows. After that, each group of duplicate rows should be hardlinked to each other.
To properly work with already hardlinked files (they have same inodes), we have two tables: "main" one with one file per inode (this can be trivially implemented by adding a UNIQUE constraint to the 'inode' column), and another one with inode-to-all-filenames mapping (actually, with list of all files).
Usage
Requirements
- shell (currently, busybox)
- sqlite3
- sed, find
Sample scripts
You can find sample script is in sample.sh. Run it, passing as arguments list of directories to scan for duplicates. But you probably should look at it first and maybe comment out last lines and check the database to see what is going to happen.