kuba--/zip

Multiple issues with zip file

HunterZ opened this issue · 7 comments

I'm trying to work with the following zip file in kubazip 0.2.6 from vcpkg in MSYS MinGW under Windows, and am running into a couple of issues: https://github.com/CarbonCommunity/Carbon/releases/download/production_build/Carbon.Windows.Release.zip

First, I get "file not found" errors when trying to extract it to a directory, regardless of whether I extract from a file or from memory.

Second, when walking the entries by index, kubazip reports that the empty directories contained in the archive are actually files (i.e. zip_entry_isdir() always returns zero).

7-zip and Windows built-in ZIP GUIs both recognize the empty folders as folders, but it's possible they're applying a different heuristic than kubazip/miniz?

I guess for now I'm going to handle all file I/O myself, and apply my own heuristic to override the result of zip_entry_isdir() as necessary.

As a workaround, I ended up walking the index and doing the following for each entry:

  • determine directory status via a combination of zip_entry_isdir() and checking for a trailing slash
  • call std::filesystem::create_directories() to create needed paths (full entry name for directory, or parent path for file)
  • in the case of a file, open via std::fstream() and call zip_entry_extract(), with a callback handler that uses std::fstream::write(), then close the file
kuba-- commented

Hello, generally the problem (file not found and isdir always returns false) is pretty simple - your zip archive does not follow zip standard. I'm not sure how it was created, but if you check zip spec (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT).
Point 4.4.17 file name: (Variable), it's clear that:

"...All slashes
MUST be forward slashes '/' as opposed to
backwards slashes ''"

On windows we replace entry name \ by / but zip_isdir implementation is pretty simple - it just checks the last character for filename in zip central directory (https://github.com/kuba--/zip/blob/master/src/miniz.h#L6267):

if (*(p + MZ_ZIP_CENTRAL_DIR_HEADER_SIZE + filename_len - 1) == '/')

So, in your case it's \.

It's a different story if you ask for zip_entry_name via zip API. The function returns the zip entry name following zip spec (unless you define ZIP_RAW_ENTRYNAME):
https://github.com/kuba--/zip/blob/master/src/zip.c#L1086

#ifdef ZIP_RAW_ENTRYNAME
  zip->entry.name = STRCLONE(entryname);
#else
  zip->entry.name = zip_strrpl(entryname, entrylen, '\\', '/');
#endif

It's not my zip file, so none of this helps as far as I can tell. Windows and 7-zip don't have a problem with it, so it's confusing that kubazip can't handle it.

kuba-- commented

It's all about standards. The question is - why do we need to implement (to support) incorrect zip files?
7-zip ignores this kind of errors, but you can always find some archive which does not follow ZIP SPEC.
On 7-zip FAQ you can also read:

Why can't 7-Zip open some ZIP archives?

In 99% of these cases it means that the archive contains incorrect headers. Other ZIP programs can open some archives with incorrect headers, since these programs just ignore errors.
If you have such archive, please don't call the 7-Zip developers about it. Instead try to find the program that was used to create the archive and inform the developers of that program that their software is not ZIP-compatible.
There are also some ZIP archives that were encoded with methods unsupported by 7-Zip, for example, WAVPack (WinZip).

This is tiny open source project, and every PR is welcome, so if this feature is important to you, just go ahead...
So far, I think we have higher priorities than supporting non-standard zip files.

It's all about standards. The question is - why do we need to implement (to support) incorrect zip files?

The answer is because I decided to use your library and then found a use case for my application where it fails, and suddenly I'm forced to decide between finding a workaround or switching libraries.

In the end I took the workaround route, which also got me to switch to doing everything in RAM. I'm not sure I'd go back at this point even if it got fixed, but I thought it was worth documenting the opportunity for making the library a bit more robust.

7-zip ignores this kind of errors, but you can always find some archive which does not follow ZIP SPEC. On 7-zip FAQ you can also read:

Interesting that the 7-zip authors felt this issue worth addressing despite that disclaimer.

FYI, I decided to check out the .zip file in question on a Linux box using command-line 7z and busybox unzip to get a wider view of the .zip tool ecosystem - and it was a disaster:

unzip follows the .zip spec even more strictly than kubazip, extracting all entries as files in the output directory whose names contain literal \ characters - even the directory entries.

7z detects the directory entries, but still extracts them as single-level-deep entries with literal \'s preserved in their names. All other files are extracted the same as unzip (e.g. file with literal name carbon\managed\modules\Carbon.Modules.dll).

This makes me a lot more sympathetic to your original response, even if Windows' built-in archiver and 7-Zip File Manager both tolerate it.

Update: Looks like it's...drum-roll...Microsoft PowerShell that is responsible for creation of these illegal .zip files!
https://github.com/CarbonCommunity/Carbon/blob/a04828986605dbce437ba125db0fdc81b1ad9af4/Tools/Build/win/build.bat#L79

Edit: Turns out that powershell runs an ancient version by default, and the bug is fixed in newer versions which can be invoked via pwsh. I've made a suggestion to the project maintainer, but who knows if it will be implemented.

Update 2: The author agreed to the change, so eventually I shouldn't need to worry about supporting this wacky stuff anymore. Thanks for your patience!