Improve sorting for better compression
fholzer opened this issue · 5 comments
Compression could further be improved by sorting files not just within a directory, but across all files of the archive. This is because similar files are grouped together across different directories, though it would require a major change to the implementation. Instead of walking the entries, and immediately triggering packing, this would need to be separated:
- walk the fs and stat each file
- sort
- pack
Of course means the more files one compresses, the more fs.stat data will be held in memory until packing begins.
I made a small POC of this in my fork. (This is a hack and by no means tested thoroughly! Its only purpose is to show what impact this has on compression.)
I compressed a tiny project of mine including the node_modules folder, which uncompressed totals about 54MB
tar-fs version | sort | size in bytes |
---|---|---|
2.1.1 | false | 12092668 |
2.1.1 | true | 12092705 |
fork based on master | false | 12092667 |
fork based on master | true | 12133054 |
fork based on master | func* | 11848151 |
Where func
is a that uses compares the result of Node's path.extname
of the file name, with a fallback on comparing the full file name in case the former results in 0.
const compressionFileSortComparator = (a, b) => {
let extA = path.extname(a.name);
let extB = path.extname(b.name);
let res = extA.localeCompare(extB, "en", { sensitivity: 'base' });
if(res != 0) {
return res;
}
return a.name.localeCompare(b.name, "en", { sensitivity: 'base' });
};
Of course results vary depending on the input, and even in my testing just providing sort: true
actually resulted in worse results. Though when using the comparator above, this resulted in 2% better compression.
Nice! The tar-stream module does most of the heavy lifting of this one, in case you wanna make a compression oriented version. Having all file names in memory is prob not worth it for this module I'd imagine.
If I understood correctly then tar-stream accepts entry details + strings/buffers or streams for calls to its pack.entry()
. Though when ppl use streams, they might do things like calling pack.entry with stream A, then wait for that stream to end before they call pack.entry again for another stream. Not sure how you would be able to handle reordering of entries in such a scenario. That would need to be done by whomever is calling pack.entry. In tar-fs it's easier, because it is tar-fs that makes those calls to tar-stream's pack.entry.
I didn't have an in-depth look at tar-stream though, so maybe I'm just missing something?
Or did I misunderstand and you meant to say that this should be taken care of by whomever is calling tar-stream/fs? In that case, feel free to close this.
Cheers/f
I meant to say that tar-stream is the workhorse of this module, so if you need something specific for your usecase making a specialised tar-fs yourself isn’t too hard :)
Alright, closing this then.