ZJONSSON/node-unzipper

unzipper 0.11.5 completely breaks reading of docx files

jribbens opened this issue · 8 comments

Summary

unzipper 0.11.4 and earlier works fine for reading DOCX files (which are zip files). 0.11.5 however is completely broken and crashes with Error: FILE_ENDED.

Sample code

import { readFile } from 'node:fs/promises'
import * as unzipper from 'unzipper'
const ARCHIVE = 'example0.docx'
const FILENAME = 'word/_rels/document.xml.rels'
console.log('opening archive')
const archive = await unzipper.Open.buffer(await readFile(ARCHIVE))
console.log('finding file')
const file = archive.files.find(file => file.path === FILENAME)
console.log('reading file')
const contents = await file.buffer()
console.log('file length', contents.length)

Output under 0.11.4

opening archive
finding file
reading file
file length 817

Output under 0.11.5

opening archive
finding file
reading file
node:internal/process/esm_loader:97
    internalBinding('errors').triggerUncaughtException(
                              ^

Error: FILE_ENDED
    at PullStream.pull (/home/jon/src/unzipper-test/node_modules/unzipper/lib/PullStream.js:78:28)
    at PullStream.emit (node:events:514:28)
    at PullStream.<anonymous> (/home/jon/src/unzipper-test/node_modules/unzipper/lib/PullStream.js:15:10)
    at PullStream.emit (node:events:526:35)
    at finish (node:internal/streams/writable:748:10)
    at finishMaybe (node:internal/streams/writable:733:9)
    at afterWrite (node:internal/streams/writable:507:3)
    at onwrite (node:internal/streams/writable:480:7)
    at cb (/home/jon/src/unzipper-test/node_modules/unzipper/lib/PullStream.js:38:14)
    at /home/jon/src/unzipper-test/node_modules/unzipper/lib/PullStream.js:71:91
    at afterWrite (node:internal/streams/writable:500:5)
    at afterWriteTick (node:internal/streams/writable:487:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:81:21)

Node.js v18.17.1

Example input file

The example input file is downloadable here: https://unequivocal.eu/dl/example0.docx

May be because of this case
image

The way I read it, there could be a file footer if the crc and length are not known ahead of time. This footer could be of length between 12 to 24 bytes.

So should we just bump the buffer to something like 100 bytes? Will check out on .docx file

So I created a simple docx file and noticed a very annoying discrepancy. It seems that the central directory shows an extraFieldLength of 0 but the local file header is showing a positive length. This is a violation of the spec, as the central directory is supposed to be the source of truth. There are two ways to approach this:

  • Add a sizeable buffer (1k) to the length and hope that it will be enough
  • Recalculate length on the basis of localheader and if local length > directory length call the unzip again with the local length. This would lead to two separate calls whenever local length is higher

The second option is a little bit annoying to implement since we are still using .then and we can't simply return a new call to unzip. It might be worthwhile changing function to async/await to make it easier

@jribbens @jpambrun can you please checkout

var localSize = 30
+ 100 // add extra padding
+ (vars.extraFieldLength || 0)
+ (vars.fileNameLength || 0)
+ vars.compressedSize;
if (localSize > length) {
entry.emit('streamRetry', localSize);
return unzip(source, offset, _password, directoryVars, localSize, entry);
}
(part of #312)

Here, I compare the size calculated from the local header to the size requested using the central directory and retry the unzip method if the local header points to a larger stream.

It's an interesting approach, but I am thinking in a file where all entries are like that it would stream the whole file twice and the performance will be cut in half?

Another idea is to have configurable extra padding and error and when it's not enough with an helpful message (e.g. Local header for entry XYZ is 76 bytes larger than specidied in the central directory, conside using a padding of >76 or more.)? And let people tweak this for their application use case?

Yet another idea is to continiue with the short stream, issue a new range request only for the missing bytes and switch the downloadding streams at the end. That feels a bit complex and error prone.

In this PR, if you look more closely, I have made the padding configurable but default to 1k.

const padding = options && options.padding || 1000;
vars.stream = function(_password) {
const totalSize = 30
+ padding // add an extra buffer
+ (vars.extraFieldLength || 0)
+ (vars.fileNameLength || 0)
+ vars.compressedSize;

So checking for localSize is just a fallback for when padding is not enough (and it also emits a streamRetry event that you can listen to).

@ZJONSSON can confirm this fixes the problem in my tests

While a minor point -- padding can't be set to zero in this case as it'll return to 1000.