Deduplicating Archiver
by Solomon Jennings
This following set of tools written in bash and perl that create a deduplicated archive of a directory, it offers several compression methods.
To try for yourself :
./dear -g archive test ./undear -c archive.tar.gz
There are 2 packages that need to be installed for Dear and Undear to work.
The first is a perl package to md5 a file:
sudo cpan install Digest::MD5::File
The second is ncompress (install shown for ubuntu):
sudo apt-get install ncompress
./dear [compression method] [archive name] [directory]
Compression Options are
gzip compression
bzip2 compression
compress compression
Or no switch to just create a tar file.
An archive can be uncompressed:
./undear [duplicate method] [archive name]
Duplicate Handling options are
restore the duplicate
symbolically link the duplicate
remove any duplicate keeping only the first found original
Dear works in the following way
- A copy of the folder to archive is moved to a temporary directory (/tmp/) this is to avoid possible destruction of the original data
- All duplicates are removed and a metadata file created with information on where the original file is located where to restore duplicates
- the folder is archived removing files
- The archive is moved to the output directory
Undear works in the following way
- The archive is unpacked.
- using the metadata file and the restore option provided the duplicates are restored
- metadata is removed
- folder is moved to correct location
Dedupliation happens in the following way
- The input directory is traversed
- A MD5 hash is created for each file and stored in a Perl Hash table
- If the MD5 already existed, the file is removed and it's location and the original location stored in a metadata file
- Filename with spaces will work
- Symbolic links will be deferenced as per -a switch of cp
- dear must be executed from the same directory of However any input and output directory path can be passed to dear
- If the destination of the archive is within the folder to compress the archive will be created and then moved into the desired folder.
- There must be enough room on the disk to make a complete copy of the input folder and the archive
- The possibility of an MD5 clash can occur, but it is extremely low (unless specially crafted) this means a false duplicate may be detected