hen packaging a large number of fragmented files, it is too slow
jellybobbin opened this issue · 4 comments
env:
windows 10
x64
16.0 GB
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz 3.41 GHz
When packaging a large number of fragmented file directories (about 300,000 files, about 20 kb each), it runs slowly when it has been packaged to more than 200,000 (it takes about 5 seconds to add one)
cargo run --example write_dir test text.zip
There are about 300,000 files in test, each of which is about 20 kb
Is this expected?
I do not think so. Could you please try to do profiling on your own? I could suggest using cargo-flamegraph
for that (however, idk does it work on Windows or not).
Having the profiling info (flamegraph, in this case) could help us a lot with identifying the root cause of the problem.
If it's scaling badly with the number of files we've put into the archive, this could be a problem with how we track metadata. The profile would help a lot in pinpointing the issue.
And yes, 5 seconds is definitely excessive.
@jellybobbin Have you tested with https://crates.io/crates/zip_next?
(300 000 * 176b) = 50mb for copies. So the metadata probably isn't the problem - even if thats a minor performance problem it wont eat 5s.
// I've attempted to reproduce with this example
use std::{fs, io, io::Write};
fn main() {
let mut newarchive = io::BufWriter::new(fs::File::create("test.zip").unwrap());
let mut zip = zip::ZipWriter::new(&mut newarchive);
let options = zip::write::FileOptions::default()
.compression_method(zip::CompressionMethod::Deflated);
let mut bytes = [0; 8096];
for chunk in bytes.chunks_exact_mut(8) {
chunk.copy_from_slice(&random().to_le_bytes());
}
let mut dirname = 0;
// writing 300,000 files
for i in 0..300000 {
// in fragmented directories (16 files on avg)
if random() < (u64::MAX / 16) {
dirname += 1;
}
let start = std::time::Instant::now();
let name = format!("{dirname}/{}.txt", i);
zip.start_file(name, options).unwrap();
// with 20kb content in each. we're using random bytes to get some entropy, though the write calls will duplicate content a little
let mut written = 0;
while written < 20 * 1024 {
written += zip.write(&bytes).unwrap();
}
println!("{}: {:?}", i, start.elapsed());
}
zip.finish().unwrap();
}
fn random() -> u64 {
use std::hash::*;
std::collections::hash_map::RandomState::new().build_hasher().finish()
}
Even after all the files are written, I'm getting writes on the order of microseconds. I'll close the issue until we can reproduce the original problem :)