dbashford/textract

Characters ignored by textract

thiagorova opened this issue · 13 comments

Hi David
i was working on a Word file (docx), and i noticed that some characters get cut off by textract (to be honest, i've only noticed one so far).
It's this symbol: ―
Thanks!

When you say cut off, you mean not extracted? So it isn't in the output?

exactly... this line was in the file:
"―Devemos buscar"
the output from textract was:
" Devemos buscar"

by the way: i tested it both in node and using the terminal (im using Linux 14.04)

I am having issues on similar lines , @ is extracted as à, {} [] are omitted.

If I may as a query-
I am using textract in node.js; the text extracted is disregarding the newline not sure how to add it in the options

for the newline theres actually an option in textract
textract.fromFileWithPath(path, {preserveLineBreaks: true}, function( error, text ) {
Inside the brackets you can use many options, preserveLineBreaks being the one that keeps your new lines.
any news on the characters being badly extracted?

What he said. =)

textract was bulit to just be a raw text extractor, mostly for things like indexing text for search, not necessarily for readability. So new lines weren't preserved. Preservation of new lines was added later as an option, but removal of new lines is still the default.

@thiagorova On this ticket, I usually update textract every couple months, knocking a bunch of things out at once, coming up on that here soon. Probably by next weekend.

@shishirrawat could you open a separate ticket to capture any problems with specific character extraction? Thanks!

sure.
thanks for the heads up!

hi!
im sorry to break it to you (and please now that im very thankful for the last update), but i noticed that, after updating textract, that character was still disappearing. i researched myself which character it was, and discovered that it was the horizontal bar (U+2015).
also, regarding things that are being ignored by textract (i can create a new topic, if you want):

  • Incorporate a communications person
    became
    -Incorporate a communications person

I haven't fixed anything yet =)

oh sorry haha...
i saw 2.1.2 (i had been using 2.1.1) and i thought it was the update. my bad

Yet another symbol ommited by textract is (\u2116).

№ was handled in another commit, the commit above handles ―, hard to tell if there were other issues with other characters. Please open issues if there were!