archiverjs/node-archiver

Slower ZIP creation after upgrade to 0.11

Closed this issue · 39 comments

I am on a windows machine running node 0.10.28. I am using this module from grunt-contrib-compress.

To compress this folder structure

image

When using grunt-contrib-compress versions these are the results.
v0.12.0 took me over 4 minutes. this produces a 18913229 byte zip file
v0.11.0 took over 4 minutes and produced a zip file.
v0.10.0 took 45 seconds and produced a 18913229 byte zip file
v0.9.0 took 45 seconds and produced a 18809351 byte zip file.
v0.8.0 took 46 seconds for the same file. this produces a 18809351 byte zip file.

version 0.11 is where they updated to archiver 0.11. Version 0.10 and below are using archiver 0.9 and are ~4 times faster on my windows machine. I would assume that the regression is in there somewhere.

interesting stats. does setting highWaterMark: 1024 * 1024 * 16 option help?

in regards to archiver 0.9 vs 0.11. there has been a lot of movement with decoupling of zip output stream and more usage of file stat to support things like mode and etc. there would essentially be 42k stat calls (x2 since both compress and archiver need to run them)

in regards to archive size, there have been a few file header fixes that led to slightly more size per file.

I'm not too worried about file size as the archives seem to be correct. I'm mainly interested in saving 3 minutes of my life :) I'll modify the highWaterMark value if you are curious to the results. What file has this value? As far as the stat calls go, I think thats all over my head.

your gruntfile options for task. stat calls are just a possibility to what is taking longer.

the highWaterMark should allow more memory to be used to buffer between the streams thus less downtime between files since zip is a serial process.

so something like

deployFiles = [
        '**',
        '!build-report.txt',
        '!util/**',
        '!jasmine-favicon-reporter/**',
        '!**/*.uncompressed.js',
        '!**/*consoleStripped.js',
        '!**/*.min.*',
        '!**/tests/**',
        '!**/bootstrap/test-infra/**',
        '!**/bootstrap/less/**'
    ],
...
compress: {
            main: {
                options: {
                    archive: 'deploy/deploy.zip',
                    highWaterMark: 1024 * 1024 * 16
                },
                files: [{
                    src: deployFiles,
                    dest: './',
                    cwd: 'dist/',
                    expand: true
                }]
            }
        }

correct.

going on 6 minutes with the highwatermark... Looks like it made it worse.

7 minutes and a file size of 17183089 bytes. Is a compression level affecting speeds?

hum the default is to use default compression level for os. how long does it take if you pass store: true just under expand: true; remove the highWaterMark also.

what kind of specs do you have? it could be a combination of things including cpu speed, memory levels, and hard drive speed. eitherway thats a lot of files to process which i believe is what your seeing as older versions of compress used an archiver that assumed more vs verifying with FS lookups.

i'm on a macbook pro running 64 bit windows server in fusion. I think the vm has 7 gig of ram and 2 processors with a 70gb assigned to it from the mbp's SSD.

http://i.imgur.com/tUZONW0.png

with store: true it took about 6:30

yah, im thinking part of it has to do with stat. i made some tweaks to the archiver core, saw about 10ms drop by making them run in a queue like system vs just before appending a file. ill have you try the newer version once its available as by those measures it should be about 2min faster if not more since im going to also allow reuse of stat data that compress gets.

hmm

sounds great?!

I wonder how archiver's performance would compare to a native tool like zip or tar, I'll do some benchmarks.

@silverwind would love seeing the results of this!

Also, you could try spawning a zip or tar command to compare node implementations (:.

native tools will be faster due to being pure c. how much faster idk as node fs bindings are c also.

@ctalkington did you cut that version you wanted me to try out?

not yet, it will most likely be part of 0.12 due to some changes in api and options. most likely will come out in 2 weeks.depending on free time.

Please do something in optimizing the things. When zipping a big git repository, it is very slow, consuming the resources:

image

And I'm on a SSD... 😞

@IonicaBizau how many files did that entail? what was resulting zip size? and how long did it take?

The zipped directory has 70MB, and using zip -r foo.zip mydir, mydir was archived quickly:

real    0m10.109s
user    0m6.050s
sys 0m3.064s

For the same directory, the archiving process started on Fri Jan 16 2015 16:04:15 GMT+0200 (EET), and haven't finished yet (after 50 minutes).

Also, the processor jumps to 100%.

screenshot from 2015-01-16 16 05 03

The node process was consuming 1.4 GB, but now it is consuming 2 GB...

screenshot from 2015-01-16 16 12 23

So I finally came around to compare archiver, yazl and native zip. This is on OS X with iojs 1.02 (which seems around 10% faster than node 0.10)

Moderate tree, 4549 files:

zip: 6.0s
yazl: 7.9s
archiver: 10.1s

Linux tree, 48410 files:

zip: 64s
yazl: 115s
archiver: 45min and still going

I tried various sized trees, and I think after a certain amount of files, archiver doesn't finish the job. Right now the last write to the archive happened 3 minutes ago, CPU sitting at 100%.

@ctalkington Test with the contents of https://github.com/torvalds/linux/archive/master.zip
@IonicaBizau Please do tell how many files you're zipping, the size is pretty irrelevant to this issue it seems.

@silverwind thanks, id also be curious to know if its better or worse when store: true is used (ie removing nodes zlib from mix)

also, was this using bulk? does yazl stat files?

also, if i had to guess on the moderate tree the 3s difference comes from archiver auto sorting out what the input is and the internal streaming.

the linux tree though, that seems like its stuck somewhere in the queue.

@silverwind do you have the script you used to compare? so that I can fiddle

Here you go: https://gist.github.com/silverwind/ac8ec0c33753057cafe7

npm install yazl archiver graceful-fs
node bench.js [folder in same dir]

Had to use graceful-fs because of EMFILE errors, but it's an easy switch in the require() if you want to use vanilla fs. Also, I didn't compare the resulting zip contents, but yazl zips ended up a bit smaller every run.

@silverwind Please do tell how many files you're zipping -- I forgot to include that, I know it makes the difference. The directory I am zipping is a git repository, containing 219623 files.

$ find . -type f | wc -l
219623

thats way more than i ever thought this library would be used for. guessing the bottleneck is the queuing but ill have to dig into it to try and figure out what changes when you get into big volumes of files.

EDIT: it does seem archiver does the job but doesn't fully finalize / close and ends up memory leaking or similar.

been testing some things, it would appear that finalize does get ran, things just never close.

EDIT: this also appears to only effect zip, tar goes through fine. im wondering if it has to do with ZIP64 kicking in.

EDIT2: so the process gets through to the point of ending the zip-stream. it would seem like the stream is backing up.

@silverwind can you confirm if running your tests back to back change the results? im noticing on windows that it seems like drive buffer or something is speeding it up ~5000 test to 14s but some times its jumps to a minute.

@ctalkington I get pretty consistent results on OS X:

yazl took 8516 ms
archiver took 10437 ms

yazl took 8530 ms
archiver took 10592 ms

yazl took 8717 ms
archiver took 10482 ms

yazl took 8819 ms
archiver took 10442 ms

yazl took 8549 ms
archiver took 10563 ms

Probably Windows prefetch or something.

I use this module in the github-contributions project. If you like to test, checkout the 1.0.0 branch before.

@ctalkington Supposing I want to fix this issue, where should I start? Where the things are getting fishy?

@IonicaBizau ive been looking at it. it seems related to the streaming getting backed up. i have noticed the new directory function to be a bit more reliable.

maybe if you wanted to test that on your samples.

Is the fix pushed in the repository, or how can I test?

0.14.0 has the new directory helper, albeit in the most basic of form.

if you want to throw your massive payload at it, then itd be good to know if it gets through it or hangs like before.

Wow, testing it! Thanks!

also, lets move this to #114 as its a little different of an issue.

closing it out as its been a mix of issues. feel free to compare your results with the latest release and open a new issue if you still see such a slow down.