morungos/node-word-extractor

Extract from a buffer

Closed this issue ยท 10 comments

njlr commented

I am using Node.js and downloading .doc files using superagent. This gives me a buffer object that I would like to parse and extract text from. However, word-extractor only seems to support files.

How do I extract the text from a .doc in memory, not in a file?

That's mainly an issue of the underlying OLE implementation, which is very much wired to use files. All the logic that depends on fs is local to OleCompoundDoc, so one solution would be to build an alternative implementation of that classes that is backed by a buffer rather than a file. Or, perhaps better, to refactor the file system access to a separate set of methods that could be overridden more easily.

It's a nice and important addition. If I can get the time for this, I will.

On the same boat here. Though, refactoring sounds to be much more complicated than I assumed...

Also, this is pretty much the same issue as #3

I implemented buffer support at gmr-fms/node-word-extractor if you guys are willing to switch to the npm package @gmr-fms/word-extractor. I didn't want to work with coffeescript hence the js source and slightly modified api.

const fs = require('fs')
const extract = require('@gmr-fms/word-extractor')

const buf = fs.readFileSync('path/to/file.doc')

extract.fromBuffer(buf).then(doc => {
  // do stuff with doc here
})

I really appreciate this library though, the code was very clean and easy to follow.

Wonderful work, @olsonpm. I've been a little snowed under here, so I'll think we can coordinate better and maybe work out how to re-merge these repositories. Like you, I've ditched CS in most repositories in favour of ES6, so that's no loss :-)

Sounds good to me. If you want to decaffeinate the code to your liking and let me know what api you have in mind for exposing buffer support - I'll create a PR.

Will do. I'm picking up a few bits of Github code for other reasons, so now is a good time. I'll decaffeinate tonight. I'll make that a separate issue for tracking.

Right, develop contains an updated codebase, using ES6 (more or less) and switched to use Jest for testing. I'm planning an 0.2.0+ release at some stage, so now is a good time to merge in changes.

Hi,
Does it also works with mathematical Equations, symbols? I have a docx file and I want to read text, math equation, images. Is it possible to do so.
Thanks in advance.

No, it won't ever handle equations. They're embedded using OLE, and aren't in a format that Word can understand. I can't even find a definition in the official MS binary format pack: https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22, so it's very possible it's still a closed specification.

Also, it seems MS killed off the Equation Editor anyway (https://www.theregister.co.uk/2018/01/16/microsoft_equation_editor_patched/) which makes it even less likely we can get the format.

I am sure the data is in there, just as part of the compound file structure it's separate from the Word contents.

The article makes entertaining reading: it looks like MS lost the Equation Editor source code at some point :-)