meganz/webclient

ZIPs of Directories with Unicode Filenames get Corrupted

cyphar opened this issue · 3 comments

NOTE: You can work around this issue by using unzip -S which will disable all file path mangling in Info-ZIP which results in the raw UTF-8 string being used as the path names. This isn't ideal, but for folks who just want a quick fix this works.

If you try to create a simple directory with a few files, for instance:

辞典
├── 大辞林.txt
├── 広辞苑.txt
└── 新明解.txt

And upload this to MEGA (or manually recreate the directory), when you try to download it as a zip (either using the "standard download" or "download as a zip" options) the filenames get corrupted (most likely some kind of encoding issue):

% unzip ../辞典.zip
Archive:  ../辞典.zip
ޥ/.txt:  mismatching "local" filename (辞典/新明解.txt),
         continuing with "central" filename version
 extracting: ޥ/.txt
ޥ/զ.txt:  mismatching "local" filename (辞典/広辞苑.txt),
         continuing with "central" filename version
 extracting: ޥ/զ.txt
ޥ/ޥ׵.txt:  mismatching "local" filename (辞典/大辞林.txt),
         continuing with "central" filename version
 extracting: ޥ/ޥ׵.txt
% tree .
.
└── \336\245\327\325\340\251
    ├── \265\373\246\265\304\336\272\372.txt
    ├── \325\246\342\336\245\327\336\357\346.txt
    └── \325\361\272\336\245\327\265\327\371.txt
% tree -N .
.
└── ޥ
    ├── .txt
    ├── զ.txt
    └── ޥ׵.txt

Apparently the "local" filenames are not getting corrupted (given the output) but if you look at the central index the names are all screwed up:

% unzip -Z ../辞典.zip
Archive:  ../辞典.zip
Zip file size: 505 bytes, number of entries: 3
-rw----     2.0 fat        0 bl stor 21-Sep-26 16:48 ޥ/.txt
-rw----     2.0 fat        0 bl stor 21-Sep-26 16:48 ޥ/զ.txt
-rw----     2.0 fat        0 bl stor 21-Sep-26 16:48 ޥ/ޥ׵.txt
3 files, 0 bytes uncompressed, 0 bytes compressed:  0.0%

After looking at the spec and the ZIP archives both Info-ZIP and MEGA generate, it seems unclear to me why unzip is confused. Bit 11 is set on both the central directory and local file metadata, which means that Info-ZIP should be interpreting the pathnames as UTF-8.

However I suspect it's getting confused because the MEGA zip file also includes the Unicode Path Extra Field (0x7075) extension (which is suggested against by the spec -- the spec says you should only use one of the two mechanims for UTF-8 paths) but contrary to the spec only sets it in the local metadata and not the global directory (resulting in Info-ZIP detecting it as a conflict). The solution would appear to be fairly simple -- either don't include the 0x7075 extension at all or include it in the global directory as well.

This logic was added by 201b50e, seemingly because bit 11 isn't supported on Windows? In any case, the extra data should be added to the central directory as well.

Hello, thank you for the report. This has been escalated to our development team for further investigation and resolution.

Thank you.