undup - compress files by consolidating duplicate data undup tries to compress an input stream by watching for blocks that have previously appeared. It replaces the duplicated data with a backreference. Integrity is ensured by validating a SHA256 across the entire stream at reconstruction time. undup is intended to be pipelined with a general-purpose compressor such as gzip, bzip2, or xz. USAGE ----- tar cf - dir | undup | xz > dir.tar.undup.xz xzcat dir.tar.undup.xz | undup -d -o dir.tar; tar xf dir.tar SAMPLE RESULTS -------------- % for r in 3.0 3.1 3.2 3.3-rc1; do git archive --format=tar --prefix=linux-$r/ v$r | tar -C /tmp/linuxes -xf - done % tar -C /tmp -cf linuxes.tar linuxes % du -shc /tmp/linuxes/* 500M /tmp/linuxes/linux-3.0 504M /tmp/linuxes/linux-3.1 511M /tmp/linuxes/linux-3.2 518M /tmp/linuxes/linux-3.3-rc1 2.0G total File sizes: 1833635840 linuxes.tar 937173504 linuxes.tar.undp 404399664 linuxes.tar.gz 316914845 linuxes.tar.bz2 270460412 linuxes.tar.xz 203023371 linuxes.tar.undp.gz 167099750 linuxes.tar.lrz 159673153 linuxes.tar.undp.bz2 138929420 linuxes.tar.undp.xz format ratio pipelined w/ undup ------ ----- ------------------ undp 1.95 gzip 4.53 9.03 bzip2 5.78 11.48 xz 6.78 13.19 lrzip 10.97 Timings for undup + compressors on Core i7 L 640 @ 2.13GHz (2.9 GHz Turbo) First, we time the undup phase. This consumes a significant amount of memory (for undup 0.2, about 105 MB of RAM to store hashes for the 1.8 GB linuxes.tar) and can be pipelined, but to get the most reproducible timing results, we've run each phase separately. undup linuxes.tar 47.26s user 4.15s system 97% cpu 52.885 total Second, we compare times for various compressors to compress linuxes.tar.undp. gzip 35.81s user 0.72s system 96% cpu 37.817 total bzip2 117.79s user 0.45s system 99% cpu 1:58.66 total xz 606.51s user 1.31s system 99% cpu 10:09.72 total undup + bzip2 achieves an 11.48x compression ratio while consuming only 165 seconds of CPU time; elapsed time for a pipeline is reasonably similar: undup 59.64s user 3.93s system 32% cpu 3:14.76 total bzip2 138.65s user 1.05s system 71% cpu 3:14.73 total This compares favorably to lrzip 0.608, which achieves a 10.97x ratio after consuming 913 seconds of CPU time (lrzip is multithreaded by default): lrzip -v -w 10 linuxes.tar 913.08s user 14.99s system 298% cpu 5:10.78 total