jgm/zip-archive

"Couldn't extract ePub file"

Closed this issue · 7 comments

Explain the problem.

pandoc cannot read this epub file. It doesn't give any more information than that one line. I can extract the same epub without issues as a zip file as well as with ebook-convert from Calibre.

❯ pandoc -i s.epub --verbose
Couldn't extract ePub file

Pandoc version?
pandoc 3.1, Arch Linux

Input epub file: Software Design X-Rays Fix Technical Debt with Behavioral Code Analysis by Adam Tornhill.zip

jgm commented

I added some additional error reporting and now get

Couldn't extract ePub file: getWordsTilSig: signature not found before EOF

The problem lies in extracting the zip container, and this message is generated by jgm/zip-archive, which evidently doesn't think this is a valid zip. (unzip has no trouble unpacking it, however.)

jgm commented

From zip-archive code:

  skip (fromIntegral extraFieldLength) -- extra field
  compressedData <- if bitflag .&. 0O10 == 0
      then getLazyByteString (fromIntegral compressedSize)
      else -- If bit 3 of general purpose bit flag is set,
           -- then we need to read until we get to the
           -- data descriptor record.  We assume that the
           -- record has signature 0x08074b50; this is not required
           -- by the specification but is common.
           do raw <- getWordsTilSig 0x08074b50

I wonder if this is a case where the signature is different?

jgm commented

How was this epub produced, do you happen to know?

I unpacked it using zip unzip -d sd Software\ Design etc. and then repacked it cd sd; zip -r ../sd.epub *. pandoc was then able to handle the repacked sd.epub.

jgm commented

I'll transfer this to zip-archive. This code was added in 4d66754

@mistmist if you're still out there, perhaps you could take a look?

jgm commented

Grepping shows that we don't have the signature "P K 07 08" in this file.

jgm commented

The documentation says

4.3.9.3 Although not originally assigned a signature, the value
0x08074b50 has commonly been adopted as a signature value
for the data descriptor record. Implementers SHOULD be
aware that ZIP files MAY be encountered with or without this
signature marking data descriptors and SHOULD account for
either case when reading ZIP files to ensure compatibility.

So I guess this is just a case where we don't have a signature. So I assume the data description is just the last 12 bytes before the start of the next local file (or something else e.g. sig 0x08064b50 or 0x02014b50).

sorry for the inconvenience, thanks @jgm for fixing this! (i was hoping to look into it this weekend)