dbashford/textract

Problems processing ODT files

tansaku opened this issue · 5 comments

We created an ODT file with the following contents:

The quick brown fox jumped over the lazy dog

Mary Mary quite contrary

Note below how the last line is repeated, and we lose the double return (i.e. extra break between two sentences)

$ node
> const textract = require('textract')
undefined
> textract.fromFileWithPath('./test/english.odt', { preserveLineBreaks: true }, function (error, text) { console.log(text)})
undefined
> The quick brown fox jumped over the lazy dog
Mary Mary quite contrary
Mary Mary quite contrary

Problem is even worse for longer files - has anyone else experienced this? or any idea how to fix?

Duplicate line is definitely a problem.

Extra line break would have been removed on purpose. First I've heard of someone wanting to preserve the whitespace.

I can look into both things. No timetable, though, wicked busy.

More complicated documents seem to have a more complicated pattern:

I just created an odt with:

One

two

three

four

five

six

and after importing get:

Onetwothreefourfivesixtwothreefourfivesixthreefourfivesixfourfivesixfivesixsix

So, the bug is definitely recursive!

Do you have an example file I can use to test with?

Have been able to duplicate this, so I'm good. Less concerned about the double return as I am about the repeating.

This is fixed. Oddly enough I was seeing this bug in my tests already, I just stupidly modified the tests (for ott files) to match the erroneous output as it didn't look hugely bad.

I have a few other small things I want to do before releasing, but this should be out in a day or so.