zip-rs/zip-old

hen packaging a large number of fragmented files, it is too slow

jellybobbin opened this issue · 4 comments

env:
windows 10
x64
16.0 GB
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz 3.41 GHz

When packaging a large number of fragmented file directories (about 300,000 files, about 20 kb each), it runs slowly when it has been packaged to more than 200,000 (it takes about 5 seconds to add one)

cargo run --example write_dir test text.zip

There are about 300,000 files in test, each of which is about 20 kb

Is this expected?

I do not think so. Could you please try to do profiling on your own? I could suggest using cargo-flamegraph for that (however, idk does it work on Windows or not).

Having the profiling info (flamegraph, in this case) could help us a lot with identifying the root cause of the problem.

If it's scaling badly with the number of files we've put into the archive, this could be a problem with how we track metadata. The profile would help a lot in pinpointing the issue.

And yes, 5 seconds is definitely excessive.

(300 000 * 176b) = 50mb for copies. So the metadata probably isn't the problem - even if thats a minor performance problem it wont eat 5s.

// I've attempted to reproduce with this example
use std::{fs, io, io::Write};

fn main() {
    let mut newarchive = io::BufWriter::new(fs::File::create("test.zip").unwrap());
    let mut zip = zip::ZipWriter::new(&mut newarchive);
    let options = zip::write::FileOptions::default()
        .compression_method(zip::CompressionMethod::Deflated);
    
    let mut bytes = [0; 8096];
    for chunk in bytes.chunks_exact_mut(8) {
        chunk.copy_from_slice(&random().to_le_bytes());
    }
    let mut dirname = 0;
    // writing 300,000 files
    for i in 0..300000 {
        // in fragmented directories (16 files on avg)
        if random() < (u64::MAX / 16) {
            dirname += 1;
        }
        let start = std::time::Instant::now();
        let name = format!("{dirname}/{}.txt", i);
        zip.start_file(name, options).unwrap();
        // with 20kb content in each. we're using random bytes to get some entropy, though the write calls will duplicate content a little
        let mut written = 0;
        while written < 20 * 1024 {
            written += zip.write(&bytes).unwrap();
        }
        println!("{}: {:?}", i, start.elapsed());
    }
    zip.finish().unwrap();
}

fn random() -> u64 {
    use std::hash::*;
    std::collections::hash_map::RandomState::new().build_hasher().finish()
}

Even after all the files are written, I'm getting writes on the order of microseconds. I'll close the issue until we can reproduce the original problem :)