/undup

store less bytes thanks to backreferences

Primary LanguageCGNU General Public License v2.0GPL-2.0

undup - compress files by consolidating duplicate data

undup tries to compress an input stream by watching for blocks that have
previously appeared.  It replaces the duplicated data with a backreference.
Integrity is ensured by validating a SHA256 across the entire stream at
reconstruction time.

undup is intended to be pipelined with a general-purpose compressor such as
gzip, bzip2, or xz.

USAGE
-----

tar cf - dir | undup | xz > dir.tar.undup.xz
xzcat dir.tar.undup.xz | undup -d -o dir.tar; tar xf dir.tar

SAMPLE RESULTS
--------------

% for r in 3.0 3.1 3.2 3.3-rc1; do
    git archive --format=tar --prefix=linux-$r/ v$r | tar -C /tmp/linuxes -xf -
done
% tar -C /tmp -cf linuxes.tar linuxes
% du -shc /tmp/linuxes/*
500M    /tmp/linuxes/linux-3.0
504M    /tmp/linuxes/linux-3.1
511M    /tmp/linuxes/linux-3.2
518M    /tmp/linuxes/linux-3.3-rc1
2.0G    total

File sizes:

1833635840   linuxes.tar
 937173504   linuxes.tar.undp
 404399664   linuxes.tar.gz
 316914845   linuxes.tar.bz2
 270460412   linuxes.tar.xz
 203023371   linuxes.tar.undp.gz
 167099750   linuxes.tar.lrz
 159673153   linuxes.tar.undp.bz2
 138929420   linuxes.tar.undp.xz


format   ratio    pipelined w/ undup
------   -----    ------------------
undp      1.95
gzip      4.53       9.03
bzip2     5.78      11.48
xz        6.78      13.19
lrzip    10.97

Timings for undup + compressors on Core i7 L 640 @ 2.13GHz (2.9 GHz Turbo)

First, we time the undup phase.  This consumes a significant amount
of memory (for undup 0.2, about 105 MB of RAM to store hashes for the
1.8 GB linuxes.tar) and can be pipelined, but to get the most
reproducible timing results, we've run each phase separately.

undup linuxes.tar 47.26s user 4.15s system 97% cpu 52.885 total

Second, we compare times for various compressors to compress
linuxes.tar.undp.

gzip   35.81s user 0.72s system 96% cpu 37.817 total
bzip2 117.79s user 0.45s system 99% cpu 1:58.66 total
xz    606.51s user 1.31s system 99% cpu 10:09.72 total

undup + bzip2 achieves an 11.48x compression ratio while consuming only 
165 seconds of CPU time; elapsed time for a pipeline is reasonably similar:

undup 59.64s user 3.93s system 32% cpu 3:14.76 total
bzip2 138.65s user 1.05s system 71% cpu 3:14.73 total

This compares favorably to lrzip 0.608, which achieves a 10.97x ratio after
consuming 913 seconds of CPU time (lrzip is multithreaded by default):

lrzip -v -w 10 linuxes.tar 913.08s user 14.99s system 298% cpu 5:10.78 total