Larger file size with compression than without
Closed this issue · 3 comments
I was puzzled to find jld2 files with compression to be quite often larger than those without: 2.8M versus 2.3M in the code below (I discovered this, when I wanted to store an array of DataFrames). For the data below, zip 3.0 reduces the size to 1.2M, which is close to the file size JLD2 achieves when the data is completely flat.
julia> using Pkg
julia> Pkg.activate(temp = true);
Activating new project at `/tmp/jl_BE21K1`
julia> Pkg.add(["JLD2", "CodecZlib"])
julia> Pkg.status()
Status `/tmp/jl_BE21K1/Project.toml`
[944b1d66] CodecZlib v0.7.6
[033835bb] JLD2 v0.5.10
julia> using JLD2, CodecZlib
julia> data = [[randn(10) for _ in 1:120] for _ in 1:100];
julia> fname = tempname() * ".jld2";
julia> jldsave(fname, false; data);
julia> run(`du -h $fname`);
2.3M /tmp/jl_uRqrgLkncj.jld2
julia> fname_compressed = tempname() * ".jld2";
julia> jldsave(fname_compressed, true; data);
julia> run(`du -h $fname_compressed`);
2.8M /tmp/jl_aRgVstKzmD.jld2
julia> fname_compressed_flat = tempname() * ".jld2";
julia> jldsave(fname_compressed_flat, true; data = vcat(vcat(data...)...));
julia> run(`du -h $fname_compressed_flat`);
904K /tmp/jl_WHSFsXIBQX.jld2
julia> zipname = tempname() * ".zip";
julia> run(`zip $zipname $fname_compressed`);
adding: tmp/jl_aRgVstKzmD.jld2 (deflated 59%)
julia> run(`du -h $zipname`);
1.2M /tmp/jl_kHETil8d6r.zip
julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 PRO 5850U with Radeon Graphics
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
Hi @jbrea,
what you are seeing is because your test data is not really compressible ( with lossless compression ).
(random floating point numbers have a lot of entropy)
When you use
julia> data = [[randn(10) for _ in 1:120] for _ in 1:100];
julia> jldsave(fname_compressed, true; data);
JLD2 will try to individually compress each of the 100*120
short arrays. As said above, they can't really be compressed and on top of that a bit of metadata for the compression library has to be added as well.
When you use the "external" compression, what you are seeing is the compression of the JLD2 metadata.
(For every individual array, there is some metadata describing the element type and the shape and so on. That is exactly the same 12000 times)
For comparison: Here's what you get with a modified example with 10 random integers between 1:10
julia> data = [[rand(1:10, 10) for _ in 1:120] for _ in 1:100]; # 10 random integers in the interval 1:10
julia> fname = tempname() * ".jld2";
julia> jldsave(fname, false; data);
julia> run(`du -h $fname`);
2.2M /tmp/jl_l6bg9sbqrY.jld2
julia> fname_compressed = tempname() * ".jld2";
julia> jldsave(fname_compressed, true; data);
julia> run(`du -h $fname_compressed`);
2.0M /tmp/jl_f4la5I3a7Q.jld2
julia> zipname = tempname() * ".zip";
julia> run(`zip $zipname $fname_compressed`);
adding: tmp/jl_f4la5I3a7Q.jld2 (deflated 81%)
julia> run(`du -h $zipname`);
372K /tmp/jl_Zov0zCfy1i.zip
julia> fname_compressed_flat = tempname() * ".jld2";
julia> jldsave(fname_compressed_flat, true; data = vcat(vcat(data...)...));
julia> run(`du -h $fname_compressed_flat`);
88K /tmp/jl_BWDcvgvCX6.jld2
julia> fname_uncompressed_flat = tempname() * ".jld2";
julia> jldsave(fname_uncompressed_flat, false; data = vcat(vcat(data...)...));
julia> run(`du -h $fname_uncompressed_flat`);
940K /tmp/jl_JGJqdqjeH8.jld2
julia> run(`zip $zipname $fname_uncompressed_flat`);
adding: tmp/jl_JGJqdqjeH8.jld2 (deflated 91%)
julia> run(`du -h $zipname`);
460K /tmp/jl_Zov0zCfy1i.zip
Here, you can see that the most efficient way is to use JLD2 compression with a single flattened dataset.
When working with floating point number, I would not recommend using compression.
There is significant compute involved and the compression level is usually not sufficiently large.
I see, thanks for the explanation. Would it therefore make sense to check if the metadata can be compressed, when jldsave
is called with compression?
No, that does not really make sense. JLD2 should always be able to open a file even without having compression libraries installed. (It may not be able to read the dataset but it can at least say, what library needs to be loaded to do so.)
Of course, you can always try to externally compress the whole file.