CP437
brainsucker opened this issue · 7 comments
I'm having trouble batch processing zip files with the lib. I'm getting unicode exceptions like this one:
*** Exception: Cannot decode byte '\x94': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream
The problem is the parser assumes, filenames are encoded as UTF-8. But according to the specification the default is CP437 which is incompatible to UTF-8, examples are 'ä', 'ö', 'ü'. Which encoding is used should be indicated by the 11th bit (language encoding flag (EFS)) from the general purpose bit flag, says the spec. Currently the EFS is just ignored by the parser. For me the question is, are there specific reasons to implement it like that or would you merge a pull request implementing it faithful to the spec? For now I helped myself silently ignoring decoding errors with decodeUtf8With and ignore, but that's rather ugly.
This was just a feature I didn't know enough about to implement.
I would accept a pull request!
+++ Peng Peng [Feb 17 14 08:51 ]:
I'm having trouble batch processing zip files with the lib. I'm getting unicode exceptions like this one:
*** Exception: Cannot decode byte '\x94': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream
The problem is the parser assumes, filenames are encoded as UTF-8. But according to the specification the default is CP437 which is incompatible to UTF-8, examples are 'ä', 'ö', 'ü'. Which encoding is used should be indicated by the 11th bit (language encoding flag (EFS)) from the general purpose bit flag, says the spec. Currently the EFS is just ignored by the parser. For me the question is, are there specific reasons to implement it like that or would you merge a pull request implementing it faithful to the spec? For now I helped myself silently ignoring decoding errors with decodeUtf8With and ignore, but that's rather ugly.
Reply to this email directly or view it on GitHub:
#14
Hi!
- I've got the same issue with CP1251 encoding with cyrillic filenames inside ZIP archive.
- Default implementation has not been working for such cases.
- I see opened PR about CP437.
Could you please provide current status of its delivery to build?
The PR is not opened, it's closed. It was closed by @mrkkrp before I even had a chance to review it, I'm not sure why.
Sorry, @mrkkrp, I failed to notice that!
The way forward, I suppose, might be to switch to using your zip package in pandoc, instead of zip-archive. (And then I could deprecate zip-archive, which I wrote solely to use in pandoc.)
@jgm, I like the idea, there should be no problem with switching (the package is well tested and have been used by several people in production), but let me know if you have any questions/issues with that!
Thank you.
After switching from zip-archive
to zip
I have no such error. But another wrong behaviour of zip
package found and it caused by CP437 (incorrect encoding recognition). I will create issue in zip
.