morungos/node-word-extractor

Error: Max buffer length exceeded: attribValue

thegoatherder opened this issue · 10 comments

I ran into the error below while trying to extract text in v1.0.2 from a DOCX that is 396 KB (405,504 bytes)

I can't seem to get the whole call stack to print out for this, but here is what it gave me:

Exception has occurred: Error: Max buffer length exceeded: attribValue
Line: 1
Column: 97393
Char: 
  at error (/harvester/node_modules/sax/lib/sax.js:651:10)
    at checkBufferLength (/harvester/node_modules/sax/lib/sax.js:125:13)
    at SAXParser.write (/harvester/node_modules/sax/lib/sax.js:1505:7)
    at SAXStream.write (/harvester/node_modules/sax/lib/sax.js:239:18)
    at AssertByteCountStream.ondata (internal/streams/readable.js:745:22)
    at AssertByteCountStream.emit (events.js:376:20)
    at addChunk (internal/streams/readable.js:309:12)
    at readableAddChunk (internal/streams/readable.js:284:9)
    at AssertByteCountStream.Readable.push (internal/streams/readable.js:223:10)
    at AssertByteCountStream.Transform.push (internal/streams/transform.js:166:32)

Further up the call stack inspector in vscode, it seems to originate from ./lib/word.js:48
So, I think the error is occurring due to line in ./lib/word.js:45:

const buffer = Buffer.alloc(512);

Potential fixes:

  • Increase the default buffer size
  • Allow the buffer size to be configurable.

I'm afraid I cannot share the document in question in a public forum like this, but if you'd like to connect I can prepare an anonymised version and show it you on screen share. I can tell you that it has an image embedded in it that appears to be high res.

PS more worrying to me than the error itself, is that I don't seem to be able to catch it.
In the code sample below, the catch block is never hit when the error is thrown

const WordExtractor = require('word-extractor')
const extractor = new WordExtractor()

extractor.extract(file.Path)
  .then((doc) => doSomething(doc))
  .catch((e) => console.log(e)

I need to be able to catch these errors - any ideas?

First of all, I think the right thing to do is call this two distinct issues. The inability to catch the error is arguably significantly worse, as that shouldn't be the case. I'll need to look more into the XML parser on that one, but I'll definitely raise a second issue to ensure that any errors from the XML side (obviously there shouldn't be any) are handled in a safe manner.

Based on your description, I should be able to mock some form of document manually that addresses the extra-long attribute issue, although it does look entirely possible that's an issue in a dependency. If I can mock it, I ought to be able to prove that one way or another, and have something safe enough to pass on for that repository if that's what's needed.

Thanks for the good, clear, report, and I'm on it.

So the good news is that I've constructed a simple Word file that shows the problem. It's a sax-js issue, as far as I can tell, which means attributes longer than 65535 throw an error. XML allows attributes of any length, so this is out of line with the specification. sax-js does handle some cases whether the buffer is out of space, but not this one.

The buffer in question is not one of ours. We do allocate a small buffer, but mainly to sniff at the first few bytes just to check if it's an Open Office format or an OLE format.

This may well be poor error handling on our part. There's no sign that we need any attributes longer than that, and we'd be fine within sax-js if we can simply ignore these errors (and continue), so it may be that solving #38 is closely linked with this after all.

Sadly, initially, no such joy. We can certainly try to swallow the error and seem to continue, but in practice, that seems to more or less shut down future parsing anyway, so I'm not sure it's going to solve either of these issues. However, I think this is relatively easy to replicate in pure sax-js.

But then, when I look at sax-js, I think: there's no point. It hasn't updated for years, so we should switch the XML layer, say to saxes. That's easier if we do find an issue. So let's do that.

OK @thegoatherder as you might have noticed in other issues, this is definitely a sax-js issue, and sax-js isn't well-enough maintained to have addressed this (and we are not the first to encounter it). So I've created yet another issue which will mean switching the XML layer to a new dependency, saxes. This will be entirely transparent, but it's going to take a day or so to get it done.

Hi @thegoatherder -- I think I have a resolution here, and it's mostly done (checked in on the develop branch if you're brave. I want to give it some more checks tomorrow but I have a few things on, so timing might be a bit unpredictable. I'll publish to npm in the next day or so. In the meantime, I am pretty sure I'll have this issue out of the door shortly.

Thanks for this - I'd like to say it was fun, and it kind of was.

Hi @morungos my timing is also a bit unpredictable today, so probably will have to wait for the next release before I can test and give feedback. Thank you for your quick attention in addressing these issues, i'm extremely grateful to be using a repo with such a responsive and courteous maintainer.

The "attribute" thing was a plain XML attribute, so, for example, in:

<p data="xxxxxx">...

There is no limit to the size of the "xxxxxx" in the XML specifications. SAX APIs try to to minimize memory, so the old library I used was crashing out when the value hit around 65k. The Open Office Word file formats (docx) are zip files full of XML, and clearly something somewhere had a long attribute, but was crashing out before the parser handed it to my code. The old library handled most cases where long text could be expected, but not this one.

All I did to replicate was unzip, manually add a ton of text to a random irrelevant XML attribute, zip it back up again and test. Then, switching to the different parser was easy, but actually getting the error handling right took a little longer to puzzle through.

I should be able to publish in an hour or so.

OK, published now. 1.0.3 is now on npmjs. Let me know how it goes.