performance bottleneck with large number of files
Closed this issue · 13 comments
this has been reported on #94 but i feel it deserves its own issue.
basically, when you get north of say 7000 entries in an archive, things start to really slow down to the point of even hanging and causing memory leaks.
👍
initial research points to a breakdown in the flow of data between the streams. this is most likely due to the amount of streams involved in the process. if its not, it sure doesnt help the debugging aspect of things.
dest <- archiver <- zipoutputstream <- zlib/checksum <- source file
Is there any example how to use the directory
function?
it also might be interesting to see how things act when a stream just discards the info vs writing to fs stream.
@IonicaBizau archive.directory('mydir');
basic at this point, no cwd adjustments etc though something i want to check into
IonicaBizau/github-contributions@4321dac
I'm not sure this is the correct way. How can I specify the output directory?
@IonicaBizau doesn't work to well for those cases yet. its mainly for testing inspired by what @silverwind had to do for testing in #94
EDIT: give me a few to see about dest handling as i had a rough version going already just wasnt commit ready.
@ctalkington I'm not sure if you're doing this yet, but it might be worth a try to use graceful-fs
, if this problem is related to having too many files open.
@silverwind dont see that being the issue as its all streams ie in-memory and the actual fs.readStream ones are only ever opened 1 by 1.
@IonicaBizau 0.14.1 might make your case a little easier :) see README
i should also note that directory
does NOT solve the actual bottleneck, just an improvement that came out of the testing.
@ctalkington Upgraded to 0.14.2
. Using the directory call, the directory is archived fastly. Didn't test on a non-SSD, but I guess it should work as well.
How to set the out directory? When I unzip
the zip file, it generates a home/ionicabizau/Documents/github-contributions/lib/...
directory structure. How to avoid that?
@IonicaBizau so your saying this helper actually fixes the issue for your size repo? also see
https://github.com/ctalkington/node-archiver#directorydirpath-destpath-data
Yes, the issue is fixed!
I released the latest version of github-contributions
, where I used the archiver
module. A retweet would help! 😄
@IonicaBizau thanks for the feedback. not sure its fixed for all cases from my testing but good to hear nonetheless.