Extract from a buffer
Closed this issue ยท 10 comments
I am using Node.js and downloading .doc
files using superagent
. This gives me a buffer object that I would like to parse and extract text from. However, word-extractor
only seems to support files.
How do I extract the text from a .doc
in memory, not in a file?
That's mainly an issue of the underlying OLE implementation, which is very much wired to use files. All the logic that depends on fs
is local to OleCompoundDoc
, so one solution would be to build an alternative implementation of that classes that is backed by a buffer rather than a file. Or, perhaps better, to refactor the file system access to a separate set of methods that could be overridden more easily.
It's a nice and important addition. If I can get the time for this, I will.
On the same boat here. Though, refactoring sounds to be much more complicated than I assumed...
Also, this is pretty much the same issue as #3
I implemented buffer support at gmr-fms/node-word-extractor if you guys are willing to switch to the npm package @gmr-fms/word-extractor
. I didn't want to work with coffeescript hence the js source and slightly modified api.
const fs = require('fs')
const extract = require('@gmr-fms/word-extractor')
const buf = fs.readFileSync('path/to/file.doc')
extract.fromBuffer(buf).then(doc => {
// do stuff with doc here
})
I really appreciate this library though, the code was very clean and easy to follow.
Wonderful work, @olsonpm. I've been a little snowed under here, so I'll think we can coordinate better and maybe work out how to re-merge these repositories. Like you, I've ditched CS in most repositories in favour of ES6, so that's no loss :-)
Sounds good to me. If you want to decaffeinate the code to your liking and let me know what api you have in mind for exposing buffer support - I'll create a PR.
Will do. I'm picking up a few bits of Github code for other reasons, so now is a good time. I'll decaffeinate tonight. I'll make that a separate issue for tracking.
Right, develop contains an updated codebase, using ES6 (more or less) and switched to use Jest for testing. I'm planning an 0.2.0+ release at some stage, so now is a good time to merge in changes.
Hi,
Does it also works with mathematical Equations, symbols? I have a docx file and I want to read text, math equation, images. Is it possible to do so.
Thanks in advance.
No, it won't ever handle equations. They're embedded using OLE, and aren't in a format that Word can understand. I can't even find a definition in the official MS binary format pack: https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22, so it's very possible it's still a closed specification.
Also, it seems MS killed off the Equation Editor anyway (https://www.theregister.co.uk/2018/01/16/microsoft_equation_editor_patched/) which makes it even less likely we can get the format.
I am sure the data is in there, just as part of the compound file structure it's separate from the Word contents.
The article makes entertaining reading: it looks like MS lost the Equation Editor source code at some point :-)