Problems processing ODT files
tansaku opened this issue · 5 comments
We created an ODT file with the following contents:
The quick brown fox jumped over the lazy dog
Mary Mary quite contrary
Note below how the last line is repeated, and we lose the double return (i.e. extra break between two sentences)
$ node
> const textract = require('textract')
undefined
> textract.fromFileWithPath('./test/english.odt', { preserveLineBreaks: true }, function (error, text) { console.log(text)})
undefined
> The quick brown fox jumped over the lazy dog
Mary Mary quite contrary
Mary Mary quite contrary
Problem is even worse for longer files - has anyone else experienced this? or any idea how to fix?
Duplicate line is definitely a problem.
Extra line break would have been removed on purpose. First I've heard of someone wanting to preserve the whitespace.
I can look into both things. No timetable, though, wicked busy.
More complicated documents seem to have a more complicated pattern:
I just created an odt with:
One
two
three
four
five
six
and after importing get:
Onetwothreefourfivesixtwothreefourfivesixthreefourfivesixfourfivesixfivesixsix
So, the bug is definitely recursive!
Do you have an example file I can use to test with?
Have been able to duplicate this, so I'm good. Less concerned about the double return as I am about the repeating.
This is fixed. Oddly enough I was seeing this bug in my tests already, I just stupidly modified the tests (for ott
files) to match the erroneous output as it didn't look hugely bad.
I have a few other small things I want to do before releasing, but this should be out in a day or so.