dbashford/textract

Problems with cyrillic symbols

TsvetkovAV opened this issue · 10 comments

When I execute js file with node.js with following content(for example with .doc file):
var textract = require('textract');

textract.fromFileWithPath('test.doc', function( error, text ) {
if (error) throw error;
console.log(text);
})

with .doc file, all cyrillic symbols ureadable (but when I execute Catdoc, then I can read it)
and with .docx file all cyrillic symbols removes.

Sorry, pressed 'Ctrl+Enter'

Can you provide or give me an idea of what I should test? Feel like this problem was solved awhile ago but there may have been a regression.

Now I found that for both types of files that is the samo problem
All cyrillic characters are removes, displays only latin characters, punctuation symbols and nubers.

test.docx
For example

And result:
c:\JSTest\catdoc>node test
textract not ready, retrying in .5 seconds
INFO: 'pdftotext' does not appear to be installed, so textract will be unable to
extract PDFs. http://www.foolabs.com/xpdf/
INFO: 'drawingtotext' does not appear to be installed, so textract will be unabl
e to extract DXFs.
, , - , - . . , , . 3.5. , , : ; , , , ; , ; , , , ( , , ); , : , , , , ; - , :
, , , , , ; : , , -, - , .., , ; ; .; ( , ; .; , ..); ; ; ; . . 4. , , , , , ,
, . 4.1. , . : BCG < >; /, . , , / (< >), - ; ; . 4.2. . 1 : < - >. 2 - < >; < >
; < >; < >. 3 - < >, , . . 5. - : , , ( , , ); < > < >, . 5.1. , , . , , , ( ),
, , , , , . , , , , . : , ; , , ; , ; , , ; , , ; , , , , ; , . , , . . 1. , . ,
. 2. , , : , ( , , , , ), , , . 3. . 4. . . . 5. : , , , , , .. . , , (, ) ().
5.2. , - . , (R) (W) ; W, , ( ); RM, RI - . . 3. , W. , . W, , R; C -, N, ( ); ,
W C , , P, -; R, , ; , R W, L, - L . , W R , , . , , R, W. 2011 DOCX

If file have only latin characters then 'Textract' work correct, as it can be.
Could you help me with this problem.
I found next:
If I comment next lines:
if ( options.preserveLineBreaks ) {
// text = text.replace( WHITELIST_PRESERVE_LINEBREAKS, ' ' );
} else {
// text = text.replace( WHITELIST_STRIP_LINEBREAKS, ' ' );
}
in your code in textract\lib\extract.js,then returns text edition(like paragraphs, spaces) and removed cyrillic characters but as question marks('?').
It's true for file .doc with same text as in test.docx(I attached it in previous comment), but for .docx file is changed only text edition, removed cyrillic characters stay removed.
Thank you.

This should be gtg. Was only happening for .docx and for .odtx as I had an extra text stripping regex that wasn't updated to include all the non-Latin characters.

Try 1.2.0.

Im having the exact same issue

screen shot 2016-01-28 at 4 48 04 pm

This is my config.

    var  buffer = new Buffer(base64, 'base64'),
          type = 'application/msword',
          config = {
             preserveLineBreaks: true
         };
    textract.fromBufferWithMime(type, buffer, config, function(error, text) {
      if (error) {
        console.log(error);
       return(error);
      } else {
       return(null,text);
      }
    });

NOTE: This only happend with .doc files im on a MAC also.

Can you give me a sample doc? (And maybe open a new issue with it to track?)

Sure.

Here is #71

Thanks