Problems with cyrillic symbols

Question

Problems with cyrillic symbols

TsvetkovAV opened this issue 9 years ago · 10 comments

When I execute js file with node.js with following content(for example with .doc file):
var textract = require('textract');

textract.fromFileWithPath('test.doc', function( error, text ) {
if (error) throw error;
console.log(text);
})

with .doc file, all cyrillic symbols ureadable (but when I execute Catdoc, then I can read it)
and with .docx file all cyrillic symbols removes.

dbashford commented 9 years ago

?

Answer 1 · 2015-11-11T19:11:41.000Z

Sorry, pressed 'Ctrl+Enter'

Answer 2 · 2015-11-11T19:17:07.000Z

Can you provide or give me an idea of what I should test? Feel like this problem was solved awhile ago but there may have been a regression.

Answer 3 · 2015-11-11T19:31:52.000Z

Now I found that for both types of files that is the samo problem
All cyrillic characters are removes, displays only latin characters, punctuation symbols and nubers.

Answer 4 · 2015-11-11T19:35:44.000Z

test.docx
For example

And result:
c:\JSTest\catdoc>node test
textract not ready, retrying in .5 seconds
INFO: 'pdftotext' does not appear to be installed, so textract will be unable to
extract PDFs. http://www.foolabs.com/xpdf/
INFO: 'drawingtotext' does not appear to be installed, so textract will be unabl
e to extract DXFs.
, , - , - . . , , . 3.5. , , : ; , , , ; , ; , , , ( , , ); , : , , , , ; - , :
, , , , , ; : , , -, - , .., , ; ; .; ( , ; .; , ..); ; ; ; . . 4. , , , , , ,
, . 4.1. , . : BCG < >; /, . , , / (< >), - ; ; . 4.2. . 1 : < - >. 2 - < >; < >
; < >; < >. 3 - < >, , . . 5. - : , , ( , , ); < > < >, . 5.1. , , . , , , ( ),
, , , , , . , , , , . : , ; , , ; , ; , , ; , , ; , , , , ; , . , , . . 1. , . ,
. 2. , , : , ( , , , , ), , , . 3. . 4. . . . 5. : , , , , , .. . , , (, ) ().
5.2. , - . , (R) (W) ; W, , ( ); RM, RI - . . 3. , W. , . W, , R; C -, N, ( ); ,
W C , , P, -; R, , ; , R W, L, - L . , W R , , . , , R, W. 2011 DOCX

Answer 5 · 2015-11-15T22:28:56.000Z

If file have only latin characters then 'Textract' work correct, as it can be.
Could you help me with this problem.
I found next:
If I comment next lines:
if ( options.preserveLineBreaks ) {
// text = text.replace( WHITELIST_PRESERVE_LINEBREAKS, ' ' );
} else {
// text = text.replace( WHITELIST_STRIP_LINEBREAKS, ' ' );
}
in your code in textract\lib\extract.js,then returns text edition(like paragraphs, spaces) and removed cyrillic characters but as question marks('?').
It's true for file .doc with same text as in test.docx(I attached it in previous comment), but for .docx file is changed only text edition, removed cyrillic characters stay removed.
Thank you.

Answer 6 · 2015-11-23T15:38:46.000Z

This should be gtg. Was only happening for .docx and for .odtx as I had an extra text stripping regex that wasn't updated to include all the non-Latin characters.

Try 1.2.0.

Answer 7 · 2016-01-28T22:51:12.000Z

Im having the exact same issue

This is my config.

    var  buffer = new Buffer(base64, 'base64'),
          type = 'application/msword',
          config = {
             preserveLineBreaks: true
         };
    textract.fromBufferWithMime(type, buffer, config, function(error, text) {
      if (error) {
        console.log(error);
       return(error);
      } else {
       return(null,text);
      }
    });

NOTE: This only happend with .doc files im on a MAC also.

Answer 8 · 2016-01-29T11:01:27.000Z

Can you give me a sample doc? (And maybe open a new issue with it to track?)

Answer 9 · 2016-01-29T14:00:11.000Z

Sure.

Here is #71

Thanks