/zipsavings

Python script to count and print savings from compression and other info using 7z

Primary LanguagePythonMIT LicenseMIT

zipsavings

zipsavings is a simple Python script that uses subprocess.Popen to invoke 7z l (and others, see Exes below) on each given archive and print stats about it.

Robust in the face of errors, does not hang in case 7z prompts for a password.

$ zipsavings ./test/dracula.zip ./test/x.tar ./test/x.tar.zst ./test/*.cso ./test/*.7z --time
ERROR: ./test/fake-bad-file.cso: no CISO or ZISO 4 magic bytes.
ERROR: ./test/fake-short-file.cso: fread failed = 4.
ERROR: ./test/dracula-encrypted.7z : Encrypted filenames.
There were 3 errors.
END OF ERRORS.

archive                  |size      |unpacked  |saved      |saved_percent|file_count|type
-------------------------|----------|----------|-----------|-------------|----------|----
./test/dracula.zip       |310.59 KiB|846.86 KiB|536.27 KiB |63.32%       |1         |zip
./test/x.tar             |10.0 KiB  |54 Bytes  |-9.95 KiB  |-18862.96%   |2         |tar
./test/x.tar.zst         |173 Bytes |10.0 KiB  |9.83 KiB   |98.31%       |1         |zst
./test/FreeDOS-FD12CD.cso|414.6 MiB |418.45 MiB|3.85 MiB   |0.92%        |1         |cso
./test/dracula.7z        |268.38 KiB|846.86 KiB|578.48 KiB |68.31%       |1         |7z
./test/dracula.zip.7z    |310.74 KiB|310.59 KiB|-149 Bytes |-0.05%       |1         |7z
./test/million-files.7z  |6.4 MiB   |5.72 MiB  |-699.06 KiB|-11.93%      |1000000   |7z
./test/snek.7z           |484.75 KiB|1.4 MiB   |946.87 KiB |66.14%       |6         |7z
Processed 8 files (3.34/s) out of 11 given (4.6/s) in 2.394 seconds.

It will also invoke (for files with extensions .cso and .zso) csoinfo, which is another small tool I made, to print information about cso and zso files. If it's missing you'll get errors but before csoinfo was made and added to zipsavings they were also errors since 7z can't parse cso/zso files.

You can get csoinfo from releases here: FRex/csoinfo.

You can get zstd from the official repo.

See Exes below for how to specify what 7z and csoinfo exe to run.

See Example usage below for more complex examples of invoking and output.

Further info

Made on Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)] on win32, might work on older versions too. If you need it for older version of Python 3 feel free to open an issue for it.

Run with --help or -h to get a help/usage message generated by argparse.

If you plan on using it or use it then please let me know and/or star this project so that I know my work on making it usable by others is not in vain.

It should never hang (for example, on 7z files with encrypted headers which make 7z l prompt for a password), crash or print garbage. On files that are mangled, encrypted 7z, files with not size info in the headers bz2, directories, non-archive files, etc. it should print an error and go on with processing and printing all the other files. The only was for it to print garbage is if 7z l itself got confused (see below in File formats section for an example of where I found such a file).

Note: field 'size' means physical size of file on disk (the size listed in 'Scanning the drive for archives:' part of 7z l output, as far as I know it's the same as the 'Physical Size = ' line but that line isn't present in outputs for some formats like gzip). It'll be the same as or slightly larger than (due to format headers, padding, filenames and so on) the sum of 'Compressed' column in 7z l output. Field 'unpacked' means the sum of 'Size' column in 7z l output and is sum of byte sizes of all files in the archive (loose files on disk might take up more or less space due to how filesystem allocates disk space exactly). Due to this uncompressed formats (iso, tar, etc.) and small archives or archives with incompressible data where padding, headers and filenames add more bytes than compression saves will show small negative savings.

Exes

zipsavings will look through PATH environment variable to find 7z and csoinfo and zstd (both without any extension and with .exe extension, on all OSes).

When looking for 7z it'll also look for 7za, and fallback to it if it's found, if both are found then 7z will be used.

To make zipsavings use other exes than ones found in PATH the environment variables ZIPSAVINGS_7ZEXE and ZIPSAVINGS_CSOINFOEXE and ZIPSAVINGS_ZSTDEXE or command line parameters --exe-7z= and --exe-csoinfo= and --exe-zstd= can be used.

If both command line parameters and environment variables are used to provide a path for a given exe then command line parameters take precedence.

It's okay to mix, e.g. let one exe be found in PATH but use environment variable or command line parameter for the other, or use command line paremeter for one of the exes and environment variable for the other one.

Options

Use -h or --help to see the full arguments list help generated by argparse.

Use --total or -t to print another entry at the end that is sum of all others, --sort=field or -s field to sort by a field (pass in wrong field name to get list of field names), add --reverse or -r to reverse the sort. Total is not sorted and always last.

Use --stdin-filelist to pass list of files from stdin (like from running find -type f or similar) due to bash/shell saying argument list is too long to contain all your files.

Use --walk-dir or --list-dir to walk a dir tree for files or use all files in a dir (but not it's subdirs).

Use --extensions to handle unknown types of tars and similar, given that the original file is next to the compressed one. For example if you have file compressed with zstandard called file.tar.zst and next to it is original file.tar then with this option sizes of both files are taken and used as compressed and uncompressed size respectively. That way any unknown single file compression can be handled (last extension is stripped and used with dot as the type of the archive, file count is 1, both files must exist on disk).

File formats

It works fully (both compression stats and file count) for rar, 7z, zip, some exe (NSIS installers), etc.

In case of iso and tar (without additional compression around it) the file count is accurate but compression will be slightly negative (see 'Note' about output field meanings above).

In case of cso and zso the file count is set to 1 since they compress a single iso file each.

In case of xz and gzip the file count will be 1 (since these aren't archives but simple compression layers around single file, just usually used with tar) but compression will be accurate (except for gzip where the original size is reported modulo 2^32 due to format limitation so for files that were bigger than 4 GiB before compression it'll be inaccurate and report high negative savings due to 'unpacked' field being too small).

You can use the option --guess-gzip-unpacked to make zipsavings try see if any other file has the same unpacked size modulo 2^32 as the gzip files being analyzed, and use that unpacked size for the gzip too. This option can be useful e.g. when comparing how well did gzip, xz and 7z compress the same big file. Only gzip will have its unpacked size modulo 2^32, so with this option all three will have correct unpacked sizes, correct saved amount and saved percent, etc. To analyze both xz/7z and gzip but print out only information about gzip files (so the totals are unaffected and the list uncluttered) the option --whitelist-type gzip can be used. See the example use of it below in examples section. Beware of the (very slim) chance of false positive if unrelated file has such unpacked size that is same modulo 2^32 to one of the gzip files. --guess-gzip-unpacked-file and --guess-gzip-unpacked-size (or multi-arg shell glob friendly equivalents --guess-gzip-unpacked-files and --guess-gzip-unpacked-sizes) can be used to pass original unpacked file (only to take size of) or the size itself (as a number) respectively to use as extra guesses when guessing unpacked gzip size. Without --guess-gzip-unpacked the only sizes used in guessing are ones from the last two options (unpacked sizes of other files are not considered).

In case of bzip2 (another compression often used with tar) an error will be printed as Size column in 7z l output is empty (bz2 file format has no header field saying how big the original uncompressed file was).

Some files give very unusual results if they confuse 7z or csoinfo itself (or if they were crafted with intention of confusing/misleading them). For example, running zipsavings on an entire tree of files using --walk-dir I ran across a .o file made by GHC (Glasgow Haskell Compiler) on Windows 10 that 7z l reports as being a bit over 50 TiB (54975648497664 bytes, exactly 50 TiB and 64 MiB) unpacked. It even broke a certain fragile part of 7z l output parsing code due to how wide that number is in the output (and I never ran across an archive that had such crazy sizes in them and thus don't have such an archive in files I test zipsavings on each commit with).

It doesn't unpack the archive nor looks at filenames to warn about possible archive-in-archive scenarios that will make savings look really small (because the real savings are in inner archives, not the outer one that this tool analyzes).

In case of a split archive the file contains the first part will work, with the type listed as Split and all other fields except size (which will be the size of just the first file) being accurate. Other parts will error out with Can not open the file as archive. or (sometimes) Headers error, unconfirmed start of archive.

Example usage

I use bash and run zipsavings using a bash script named zipsavings in my PATH that does python /path/to/my/dir/zipsavings/__main__.py "$@" (python is Python 3). If you don't intend to tinker with the code you can instead pack it with python -m zipapp -c zipsavings and then run the resulting .pyz file with python /some/path/zipsavings.pyz in some batch/bash/sh script in your PATH, or pack it into a standalone executable file using some other tool. Adjust the first part of the examples accordingly to your set up.

$ python zipsavings.pyz zipsavings.pyz
archive       |size    |unpacked |saved   |saved_percent|file_count|type
--------------|--------|---------|--------|-------------|----------|----
zipsavings.pyz|5.34 KiB|13.71 KiB|8.37 KiB|61.07%       |6         |zip
$ zipsavings ./test/snek.7z
archive       |size      |unpacked|saved     |saved_percent|file_count|type
--------------|----------|--------|----------|-------------|----------|----
./test/snek.7z|484.75 KiB|1.4 MiB |946.87 KiB|66.14%       |6         |7z
$ zipsavings zero.data.xz zero.data.gz
archive     |size     |unpacked |saved     |saved_percent|file_count|type
------------|---------|---------|----------|-------------|----------|----
zero.data.xz|2.21 MiB |8.79 GiB |8.79 GiB  |99.98%       |1         |xz
zero.data.gz|39.26 MiB|808.0 MiB|768.74 MiB|95.14%       |1         |gzip

$ zipsavings zero.data.xz zero.data.gz --guess-gzip-unpacked
archive     |size     |unpacked|saved   |saved_percent|file_count|type
------------|---------|--------|--------|-------------|----------|----
zero.data.xz|2.21 MiB |8.79 GiB|8.79 GiB|99.98%       |1         |xz
zero.data.gz|39.26 MiB|8.79 GiB|8.75 GiB|99.56%       |1         |gzip
$ zipsavings --list-dir test --total --sort file_count --reverse --time
ERROR: test/a.bz2 : No size data in bzip2 format.
ERROR: test/b.notarchive : Can not open the file as archive.
ERROR: test/dracula-encrypted.7z : Encrypted filenames.
ERROR: test/dracula.txt : Can not open the file as archive.
ERROR: test/fake-bad-file.cso: no CISO or ZISO 4 magic bytes.
ERROR: test/fake-short-file.cso: fread failed = 4.
ERROR: test/FreeDOS-FD12CD.7z.002 : Headers error, unconfirmed start of archive.
ERROR: test/FreeDOS-FD12CD.7z.003 : Headers error, unconfirmed start of archive.
ERROR: test/FreeDOS-FD12CD.7z.004 : Headers error, unconfirmed start of archive.
ERROR: test/FreeDOS-FD12CD.7z.005 : Headers error, unconfirmed start of archive.
ERROR: test/FreeDOS-FD12CD.zip.002 : Headers error, unconfirmed start of archive.
ERROR: test/FreeDOS-FD12CD.zip.003 : Headers error, unconfirmed start of archive.
ERROR: test/FreeDOS-FD12CD.zip.004 : Headers error, unconfirmed start of archive.
ERROR: test/FreeDOS-FD12CD.zip.005 : Headers error, unconfirmed start of archive.
ERROR: test/random10megs.7z.002 : Can not open the file as archive.
ERROR: test/random10megs.7z.003 : Can not open the file as archive.
ERROR: test/random10megs.binary : Can not open the file as archive.
ERROR: test/random10megs.zip.002 : Can not open the file as archive.
ERROR: test/random10megs.zip.003 : Can not open the file as archive.
ERROR: test/wat.txt.bz2 : No size data in bzip2 format.
There were 20 errors.
END OF ERRORS.

archive                                |size      |unpacked  |saved      |saved_percent|file_count|type
---------------------------------------|----------|----------|-----------|-------------|----------|-----
test/million-files.7z                  |6.4 MiB   |5.72 MiB  |-699.06 KiB|-11.93%      |1000000   |7z
test/d8krhj4kasdu3~.swf                |11.38 MiB |11.36 MiB |-15.36 KiB |-0.13%       |2628      |SWF
test/FreeDOS-FD12CD.iso                |418.45 MiB|417.5 MiB |-971.99 KiB|-0.23%       |553       |Iso
test/NorthBuryGrove.rar                |966.21 MiB|2.17 GiB  |1.23 GiB   |56.55%       |198       |Rar5
test/Fedora-Xfce-Live-x86_64-28-1.1.iso|1.29 GiB  |1.37 GiB  |84.26 MiB  |6.0%         |39        |Iso
test/windirstat1_1_2_setup.exe         |630.59 KiB|2.16 MiB  |1.54 MiB   |71.47%       |23        |Nsis
test/snek.7z                           |484.75 KiB|1.4 MiB   |946.87 KiB |66.14%       |6         |7z
test/showframe.cab                     |1.33 KiB  |2.29 KiB  |990 Bytes  |42.15%       |2         |Cab
test/x.tar                             |10.0 KiB  |54 Bytes  |-9.95 KiB  |-18862.96%   |2         |tar
test/d.gz                              |22 Bytes  |0 Bytes   |-22 Bytes  |0%           |1         |gzip
test/d8krhj4kasdu3.swf                 |9.87 MiB  |11.38 MiB |1.51 MiB   |13.29%       |1         |SWFc
test/dracula.7z                        |268.38 KiB|846.86 KiB|578.48 KiB |68.31%       |1         |7z
test/dracula.zip                       |310.59 KiB|846.86 KiB|536.27 KiB |63.32%       |1         |zip
test/dracula.zip.7z                    |310.74 KiB|310.59 KiB|-149 Bytes |-0.05%       |1         |7z
test/fixpdfmag.tar.lzma                |1.29 KiB  |10.0 KiB  |8.71 KiB   |87.14%       |1         |lzma
test/FreeDOS-FD12CD.7z.001             |100.0 MiB |418.45 MiB|318.45 MiB |76.1%        |1         |Split
test/FreeDOS-FD12CD.cso                |414.6 MiB |418.45 MiB|3.85 MiB   |0.92%        |1         |cso
test/FreeDOS-FD12CD.zip.001            |100.0 MiB |418.45 MiB|318.45 MiB |76.1%        |1         |Split
test/FreeDOS-FD12CD.zso                |415.23 MiB|418.45 MiB|3.22 MiB   |0.77%        |1         |zso
test/random10megs.7z.001               |4.0 MiB   |10.0 MiB  |6.0 MiB    |60.0%        |1         |Split
test/random10megs.zip.001              |4.0 MiB   |10.0 MiB  |6.0 MiB    |60.0%        |1         |Split
test/wat.txt.gz                        |1.0 MiB   |1.0 MiB   |-186 Bytes |-0.02%       |1         |gzip
test/xz.xz                             |492.93 KiB|4.79 MiB  |4.31 MiB   |89.96%       |1         |xz
---------------------------------------|----------|----------|-----------|-------------|----------|-----
TOTAL(23)                              |3.69 GiB  |5.64 GiB  |1.96 GiB   |34.7%        |1003465   |SUM
Processed 23 files (7.98/s) out of 43 given (14.92/s) in 2.883 seconds.

Efficiency

This tool's main bottleneck is actual disk access and starting and running 7z processes themselves (one for each archive given).

All of the timings below are from repeated runs on a quad core (8 thread) Intel CPU on a laptop with plenty free RAM (for OS to cache into) with 7z.exe and Python 3 on an SSD and archives and this tool's code on an HDD.

Running with more cores or without OS caching the exes, code and archives into RAM will give different results but the point of 7z being the bottleneck and Python code being irrelevant still stands.

The above 23 analyzable archives among 43 files takes literally no time to run:

$ zipsavings test/* -t -s file_count -r --time 2>&1 | tail -n 1
Processed 23 files (8.69/s) out of 43 given (16.25/s) in 2.646 seconds.

Running it on a very large list of files (MiKTex local package repo) show that 7z is the real bottleneck:

$ find D:/MiKTexDownloadFiles -type f | zipsavings --stdin-filelist -t -s file_count -r --time 2>&1 | tail -n 1
Processed 3463 files (246.79/s) out of 3530 given (251.57/s) in 14.032 seconds.

Changing code to run and wait for 1 7z process at a time (by changing for file_group in split_into_portions(all_files, 8): to for file_group in split_into_portions(all_files, 1):) causes the tool to take predictably longer:

$ find D:/MiKTexDownloadFiles -type f | zipsavings --stdin-filelist -t -s file_count -r --time 2>&1 | tail -n 1
Processed 3463 files (69.61/s) out of 3530 given (70.96/s) in 49.747 seconds.