zstd --patch-from very inefficient for files about 2GiB
reyqn opened this issue · 6 comments
Describe the bug
I'm using patch-from to generate diffs for very big files, and the files being over the 2GiB limit I truncate them before calling zstd.
However, if I truncate them to be 2GiB, patch-from is very inefficient (truncating at 2GB makes the diffs about 40MiB, 2GiB generates a 1.8GiB diff, that's more than 40x less efficient).
I noticed than if I truncate somewhere between 2GB and 2GiB, the more I approach 2GiB the less patch-from is efficient.
To Reproduce
Steps to reproduce the behavior:
- Take two files of different versions about 2GiB
- Call zstd --patch-from v1 v2
- The generated zst file is huge, even if zstd is usually capable of compressing way better
Expected behavior
The generated zst file should be about the same size for two 1GiB file than for one 2GiB file
Desktop (please complete the following information):
- OS: Windows
- Version 11
- I used zstd.exe from release 1.5.6
Additional context
As an aside, I'm not sure how to use paramgrill for choosing compression levels with zstd --patch-from. Can paramgrill help with this? Should I run it against the old file? The new one? A concatenation of both? Or not at all?
Thanks!
- It is generally expected that, as the file size grows, finding back the right reference into the dictionary becomes more difficult. This can probably be improved over time, but this is a fuzzy problem: the exact impact depends on the scenario, i.e. the type of content and the compression parameters.
- Speaking of which, which compression level are you using in your tests ?
paramgrill
is unfortunately not designed to support--patch-from
.
I tried compression levels from 1 to 19, and even negative ones, they don't make much of a difference.
I just found it quite surprising that truncating the files at a different point could have a 40x impact on ratio. I'm curious if you would have recommendations for where I should truncate files. Right now I thought 2GB was good, it's not much less than the hard limit, and I didn't see it backfire yet.
It's too bad paramgrill's result cannot translate at all to patch-from, it would have been nice.
The classical test we do to check --patch-from
efficiency is the linux kernel tarball update.
Right now, if I'm using the latest 2 releases as an example, compressing linux-6.12-rc5.tar
from linux-6.12-rc4.tar
, which are both ~1.5 GB in size, I'm getting a patch of ~30 MB. Which seems fine.
I noticed than if I truncate somewhere between 2GB and 2GiB, the more I approach 2GiB the less patch-from is efficient.
Which sizes have you tried?
If memory serves right, I tried :
- 0x77777777 which is about 2GB with a result of about 45MiB diff
- 0x7F000000 with a result of about 47MiB
- 0x7FF00000 with a result of about 50MiB
- 0x7FFF0000 with a result of about 450MiB
- 0x7FFFF000 with a result of about 1.5GiB
- 0x7FFFFFFF with a result of about 1.8GiB
I don't have access to a lot of more than 2GiB versioned files, but I originally changed my truncate point from 2GiB to 1GiB after noticing the same issue for other files, so I thought this wasn't an isolated case, and there might be a soft limit to what patch-from can do.
OK, this is really at the tail end that things worsen,
after 0x7FF00000
which is still fine, and already very close to the full 2 GB (that's basically the last MB).
Things get worse at 0x7FFF0000
, which is just 64 KB shy of the 2 GB limit.
Since a full block is 128 KB, I wouldn't be surprised if there was some kind of limit there.
This is exactly the kind of explanation I hoped for.
Zstd has been nice to work with. Thanks for your work and your quick and insightful answers.