jgm/zip-archive

CP437

brainsucker opened this issue · 7 comments

I'm having trouble batch processing zip files with the lib. I'm getting unicode exceptions like this one:

*** Exception: Cannot decode byte '\x94': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream

The problem is the parser assumes, filenames are encoded as UTF-8. But according to the specification the default is CP437 which is incompatible to UTF-8, examples are 'ä', 'ö', 'ü'. Which encoding is used should be indicated by the 11th bit (language encoding flag (EFS)) from the general purpose bit flag, says the spec. Currently the EFS is just ignored by the parser. For me the question is, are there specific reasons to implement it like that or would you merge a pull request implementing it faithful to the spec? For now I helped myself silently ignoring decoding errors with decodeUtf8With and ignore, but that's rather ugly.

jgm commented

This was just a feature I didn't know enough about to implement.
I would accept a pull request!

+++ Peng Peng [Feb 17 14 08:51 ]:

I'm having trouble batch processing zip files with the lib. I'm getting unicode exceptions like this one:

*** Exception: Cannot decode byte '\x94': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream

The problem is the parser assumes, filenames are encoded as UTF-8. But according to the specification the default is CP437 which is incompatible to UTF-8, examples are 'ä', 'ö', 'ü'. Which encoding is used should be indicated by the 11th bit (language encoding flag (EFS)) from the general purpose bit flag, says the spec. Currently the EFS is just ignored by the parser. For me the question is, are there specific reasons to implement it like that or would you merge a pull request implementing it faithful to the spec? For now I helped myself silently ignoring decoding errors with decodeUtf8With and ignore, but that's rather ugly.


Reply to this email directly or view it on GitHub:
#14

Hi!

  • I've got the same issue with CP1251 encoding with cyrillic filenames inside ZIP archive.
  • Default implementation has not been working for such cases.
  • I see opened PR about CP437.

Could you please provide current status of its delivery to build?

jgm commented

The PR is not opened, it's closed. It was closed by @mrkkrp before I even had a chance to review it, I'm not sure why.

@jgm Please note that the PR (or rather issue) you see referenced in this thread has nothing to do with the zip-archive package. It's related to my zip package which supports this properly from the very beginning. So I'm not sure what is the problem.

jgm commented

Sorry, @mrkkrp, I failed to notice that!

The way forward, I suppose, might be to switch to using your zip package in pandoc, instead of zip-archive. (And then I could deprecate zip-archive, which I wrote solely to use in pandoc.)

@jgm, I like the idea, there should be no problem with switching (the package is well tested and have been used by several people in production), but let me know if you have any questions/issues with that!

Thank you.

After switching from zip-archive to zip I have no such error. But another wrong behaviour of zip package found and it caused by CP437 (incorrect encoding recognition). I will create issue in zip.