mafintosh/tar-fs

Improve sorting for better compression

fholzer opened this issue · 5 comments

Compression could further be improved by sorting files not just within a directory, but across all files of the archive. This is because similar files are grouped together across different directories, though it would require a major change to the implementation. Instead of walking the entries, and immediately triggering packing, this would need to be separated:

  1. walk the fs and stat each file
  2. sort
  3. pack

Of course means the more files one compresses, the more fs.stat data will be held in memory until packing begins.

I made a small POC of this in my fork. (This is a hack and by no means tested thoroughly! Its only purpose is to show what impact this has on compression.)

I compressed a tiny project of mine including the node_modules folder, which uncompressed totals about 54MB

tar-fs version sort size in bytes
2.1.1 false 12092668
2.1.1 true 12092705
fork based on master false 12092667
fork based on master true 12133054
fork based on master func* 11848151

Where func is a that uses compares the result of Node's path.extname of the file name, with a fallback on comparing the full file name in case the former results in 0.

const compressionFileSortComparator = (a, b) => {
    let extA = path.extname(a.name);
    let extB = path.extname(b.name);

    let res = extA.localeCompare(extB, "en", { sensitivity: 'base' });
    if(res != 0) {
        return res;
    }
    return a.name.localeCompare(b.name, "en", { sensitivity: 'base' });
};

Of course results vary depending on the input, and even in my testing just providing sort: true actually resulted in worse results. Though when using the comparator above, this resulted in 2% better compression.

Nice! The tar-stream module does most of the heavy lifting of this one, in case you wanna make a compression oriented version. Having all file names in memory is prob not worth it for this module I'd imagine.

If I understood correctly then tar-stream accepts entry details + strings/buffers or streams for calls to its pack.entry(). Though when ppl use streams, they might do things like calling pack.entry with stream A, then wait for that stream to end before they call pack.entry again for another stream. Not sure how you would be able to handle reordering of entries in such a scenario. That would need to be done by whomever is calling pack.entry. In tar-fs it's easier, because it is tar-fs that makes those calls to tar-stream's pack.entry.
I didn't have an in-depth look at tar-stream though, so maybe I'm just missing something?

Or did I misunderstand and you meant to say that this should be taken care of by whomever is calling tar-stream/fs? In that case, feel free to close this.
Cheers/f

I meant to say that tar-stream is the workhorse of this module, so if you need something specific for your usecase making a specialised tar-fs yourself isn’t too hard :)

Alright, closing this then.