dbashford/textract

extract text from doc files(windows10 64)

SHocker-Yu opened this issue · 10 comments

"DOC extraction requires antiword be installed, link, unless on OSX in which case textutil (installed by default) is used."

OS: windows10 64
I installed antiword.exe failed,and i don't konw how to do with this problem...

zzzwx commented

have you declared the path to your antiword.exe file in the PATH global variable ?

@zzzwx thanks for your reply,antiword does not support Windows.

zzzwx commented

@SHocker-Yu i am using it on windows (7 and 10)
some good fellow actually compiled it for windows, get it there : http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/

@zzzwx appreciate for your kind reply.
I loaded it at last time, but when i want to run antiword.exe, it flash back,
OS: Windows10
Have you come across this situation?
Could you tell me how to make it running success?

zzzwx commented

@SHocker-Yu what do you mean by "flash back" ?

here are the steps I followed to make it work on windows :

0/ modify textract/lib/extractors/doc.js to fix a bug reported in a github issue

-        if ( error.toString().indexOf( 'is not a Word Document' ) ) {
+        if ( error.toString().indexOf( 'is not a Word Document' ) > 0 ) {

1/ download windows binary

2/ add antiword directory to Windows' PATH environnement variable

=> at this point it worked but only when the path to the doc file contained no spaces

3/ modify textract/lib/extractors/doc.js again to add quotes so that it reads the input path as is

-    var escapedPath = filePath.replace( /\s/g, '\\ ' );
+   var escapedPath = filePath/*.replace( /\s/g, '\\ ' )*/;

-    exec( 'antiword ' + escapedPath,
+   exec( 'antiword "' + escapedPath + '"',

=> at this point it worked for every paths

4/ modify textract/lib/extractors/doc.js one last time to manage UTF8 encoding of output text

  - exec( 'antiword "' + escapedPath + '"',
  + exec( 'antiword -m UTF-8.txt "' + escapedPath + '"',

=> and after that it worked well all the time :)

hope this helps you

@zzzwx I really appreciate for your kind,so sorry about my pool English,'flash back' means 'crash',these days i had to work all day ,and reply you so late,really sorry, i have readed your reply,and i will try it and then tell you the result.
Best wishes.

@zzzwx It works!Thank you so much!!!

FYI, I've implemented the changes from above across a few different commits the last few months (sorry so slow!).

Published as 2.1, thanks!

zzzwx commented

Hi @dbashford , thank you for your work