uktrade/stream-unzip

Doesn't support BZIP2 compression

Stranger65536 opened this issue · 8 comments

Can't unzip an archive created with python build-in compress_type=ZIP_BZIP2, compress_level=9:

File "/Users/trofiv/Projects/trofiv/es-backup/venv/lib/python3.9/site-packages/stream_unzip.py", line 424, in stream_unzip
   for file_name, file_size, unzipped_chunks in all():
 File "/Users/trofiv/Projects/trofiv/es-backup/venv/lib/python3.9/site-packages/stream_unzip.py", line 416, in all
   yield yield_file(yield_all, get_num, return_num_unused, return_bytes_unused, get_offset_from_start)
 File "/Users/trofiv/Projects/trofiv/es-backup/venv/lib/python3.9/site-packages/stream_unzip.py", line 376, in yield_file
   raise UnsupportedCompressionTypeError(compression)
stream_unzip.UnsupportedCompressionTypeError: 12

Ah thanks for the report. So yes, you're right this isn't supported.

Can I ask your use case? Specifically - are you making ZIPs to be released so "anyone" should be able to open them, or just yourself?

I'm torn right now on whether bzip2 support should be added. So far it's sort of been a non-aim to support absolutely every possible ZIP file you can construct that's technically valid, but rather just the ones that you would expect to encounter "in the wild".

Actually, I asked a similar question on this a while back, to work out whether we should support other compression mechansim to Deflate/Deflate64.

Since it looked like there weren't ZIP files in the wild that used a different mechanism at the time, I decided that it wasn't worth it.

Thanks for the response. Your point is quite valid - deflated format is the most common. Whereas Python supports more formats as a part of its standard library (here is the quote from the official documentation):

The ZIP file format specification has included support for bzip2 compression since 2001, and for LZMA compression since 2006. However, some tools (including older Python releases) do not support these compression methods, and may either refuse to process the ZIP file altogether, or fail to extract individual files.

So, as of your library is a kind of a great problem-solver for some specific (but quite frequent) use-cases, it makes sense to support all the "built-in" Python zip compression mechanisms.

I think this would make your library more popular and more useful.

your library is a kind of a great problem-solver for some specific (but quite frequent) use-cases,

Flattery will get you everywhere :-)

But more seriously, do you have more detail on your specific use case? And maybe the answer to this question - why ZIP files with bz2 rather than, for example, a .tar.bz2 file? (I've seen .tar.bz2 files around the place I'm fairly sure)

But more seriously, do you have more detail on your specific use case?

Sure. My use case is to be able to create pretty large (over 100GB of the raw JSON data. More specifically, it's a large JSON array of mid-size JSON objects.) archive with the best compression possible on fly. Data is fetched externally in chunks (using a generator). And vice versa - I need to uncompress the same data on fly and provide it as a generator to external consumer.

Generally speaking, it's a kind of full backup of database collections with metadata alongside.

Why zip? Because it's is the most supported cross-platform container format. I can check content on almost any platform with no extra tools, as almost all modern systems support LZMA and BZIP2 with no extra tools (at least for read).

Sounds reasonable so far, and it doesn't need too much more code from my initial investifation. I have added it in #40 if you want to give it a test?

Works like a charm! Thank you very much for such rapid work done!