Slower ZIP creation after upgrade to 0.11
Closed this issue · 39 comments
I am on a windows machine running node 0.10.28
. I am using this module from grunt-contrib-compress
.
To compress this folder structure
When using grunt-contrib-compress
versions these are the results.
v0.12.0 took me over 4 minutes. this produces a 18913229 byte zip file
v0.11.0 took over 4 minutes and produced a zip file.
v0.10.0 took 45 seconds and produced a 18913229 byte zip file
v0.9.0 took 45 seconds and produced a 18809351 byte zip file.
v0.8.0 took 46 seconds for the same file. this produces a 18809351 byte zip file.
version 0.11 is where they updated to archiver 0.11. Version 0.10 and below are using archiver 0.9 and are ~4 times faster on my windows machine. I would assume that the regression is in there somewhere.
interesting stats. does setting highWaterMark: 1024 * 1024 * 16
option help?
in regards to archiver 0.9 vs 0.11. there has been a lot of movement with decoupling of zip output stream and more usage of file stat to support things like mode and etc. there would essentially be 42k stat calls (x2 since both compress and archiver need to run them)
in regards to archive size, there have been a few file header fixes that led to slightly more size per file.
I'm not too worried about file size as the archives seem to be correct. I'm mainly interested in saving 3 minutes of my life :) I'll modify the highWaterMark
value if you are curious to the results. What file has this value? As far as the stat calls go, I think thats all over my head.
your gruntfile options for task. stat calls are just a possibility to what is taking longer.
the highWaterMark should allow more memory to be used to buffer between the streams thus less downtime between files since zip is a serial process.
so something like
deployFiles = [
'**',
'!build-report.txt',
'!util/**',
'!jasmine-favicon-reporter/**',
'!**/*.uncompressed.js',
'!**/*consoleStripped.js',
'!**/*.min.*',
'!**/tests/**',
'!**/bootstrap/test-infra/**',
'!**/bootstrap/less/**'
],
...
compress: {
main: {
options: {
archive: 'deploy/deploy.zip',
highWaterMark: 1024 * 1024 * 16
},
files: [{
src: deployFiles,
dest: './',
cwd: 'dist/',
expand: true
}]
}
}
correct.
going on 6 minutes with the highwatermark... Looks like it made it worse.
7 minutes and a file size of 17183089 bytes. Is a compression level affecting speeds?
hum the default is to use default compression level for os. how long does it take if you pass store: true
just under expand: true
; remove the highWaterMark also.
what kind of specs do you have? it could be a combination of things including cpu speed, memory levels, and hard drive speed. eitherway thats a lot of files to process which i believe is what your seeing as older versions of compress used an archiver that assumed more vs verifying with FS lookups.
yah, im thinking part of it has to do with stat. i made some tweaks to the archiver core, saw about 10ms drop by making them run in a queue like system vs just before appending a file. ill have you try the newer version once its available as by those measures it should be about 2min faster if not more since im going to also allow reuse of stat data that compress gets.
I wonder how archiver's performance would compare to a native tool like zip
or tar
, I'll do some benchmarks.
@silverwind would love seeing the results of this!
Also, you could try spawning a zip
or tar
command to compare node
implementations (:.
native tools will be faster due to being pure c. how much faster idk as node fs bindings are c also.
@ctalkington did you cut that version you wanted me to try out?
not yet, it will most likely be part of 0.12 due to some changes in api and options. most likely will come out in 2 weeks.depending on free time.
@IonicaBizau how many files did that entail? what was resulting zip size? and how long did it take?
The zipped directory has 70MB, and using zip -r foo.zip mydir
, mydir
was archived quickly:
real 0m10.109s
user 0m6.050s
sys 0m3.064s
For the same directory, the archiving process started on Fri Jan 16 2015 16:04:15 GMT+0200 (EET)
, and haven't finished yet (after 50 minutes).
Also, the processor jumps to 100%
.
The node
process was consuming 1.4 GB
, but now it is consuming 2 GB
...
So I finally came around to compare archiver, yazl and native zip. This is on OS X with iojs 1.02 (which seems around 10% faster than node 0.10)
Moderate tree, 4549 files:
zip: 6.0s
yazl: 7.9s
archiver: 10.1s
Linux tree, 48410 files:
zip: 64s
yazl: 115s
archiver: 45min and still going
I tried various sized trees, and I think after a certain amount of files, archiver doesn't finish the job. Right now the last write to the archive happened 3 minutes ago, CPU sitting at 100%.
@ctalkington Test with the contents of https://github.com/torvalds/linux/archive/master.zip
@IonicaBizau Please do tell how many files you're zipping, the size is pretty irrelevant to this issue it seems.
@silverwind thanks, id also be curious to know if its better or worse when store: true
is used (ie removing nodes zlib from mix)
also, was this using bulk? does yazl stat files?
also, if i had to guess on the moderate tree the 3s difference comes from archiver auto sorting out what the input is and the internal streaming.
the linux tree though, that seems like its stuck somewhere in the queue.
@silverwind do you have the script you used to compare? so that I can fiddle
Here you go: https://gist.github.com/silverwind/ac8ec0c33753057cafe7
npm install yazl archiver graceful-fs
node bench.js [folder in same dir]
Had to use graceful-fs because of EMFILE errors, but it's an easy switch in the require() if you want to use vanilla fs. Also, I didn't compare the resulting zip contents, but yazl zips ended up a bit smaller every run.
@silverwind Please do tell how many files you're zipping -- I forgot to include that, I know it makes the difference. The directory I am zipping is a git repository, containing 219623 files.
$ find . -type f | wc -l
219623
thats way more than i ever thought this library would be used for. guessing the bottleneck is the queuing but ill have to dig into it to try and figure out what changes when you get into big volumes of files.
EDIT: it does seem archiver does the job but doesn't fully finalize / close and ends up memory leaking or similar.
been testing some things, it would appear that finalize does get ran, things just never close.
EDIT: this also appears to only effect zip, tar goes through fine. im wondering if it has to do with ZIP64 kicking in.
EDIT2: so the process gets through to the point of ending the zip-stream. it would seem like the stream is backing up.
@silverwind can you confirm if running your tests back to back change the results? im noticing on windows that it seems like drive buffer or something is speeding it up ~5000 test to 14s but some times its jumps to a minute.
@ctalkington I get pretty consistent results on OS X:
yazl took 8516 ms
archiver took 10437 ms
yazl took 8530 ms
archiver took 10592 ms
yazl took 8717 ms
archiver took 10482 ms
yazl took 8819 ms
archiver took 10442 ms
yazl took 8549 ms
archiver took 10563 ms
Probably Windows prefetch or something.
I use this module in the github-contributions
project. If you like to test, checkout the 1.0.0
branch before.
@ctalkington Supposing I want to fix this issue, where should I start? Where the things are getting fishy?
@IonicaBizau ive been looking at it. it seems related to the streaming getting backed up. i have noticed the new directory function to be a bit more reliable.
maybe if you wanted to test that on your samples.
Is the fix pushed in the repository, or how can I test?
0.14.0 has the new directory helper, albeit in the most basic of form.
if you want to throw your massive payload at it, then itd be good to know if it gets through it or hangs like before.
Wow, testing it! Thanks!
also, lets move this to #114 as its a little different of an issue.
closing it out as its been a mix of issues. feel free to compare your results with the latest release and open a new issue if you still see such a slow down.