Support for non scanned documents (.doc, .docx, regular pdf)

Question

Support for non scanned documents (.doc, .docx, regular pdf)

Opened this issue 8 years ago · 4 comments

Hi @jlsutherland and thanks for this cool module, OCR is a hard problem and you provide a pretty efficient and simple solution.

Would you be interested by PR with text extraction for non-scanned documents ? I think it fits the module name "doc2text" quite well but maybe you want to stick with just OCR, let me know

Answer 1 · 2016-09-05T15:32:20.000Z

We already have a bunch of codes for that in my company that I'm going to refactor, so I can provides a PR. (we mainly use textract (http://textract.readthedocs.io/en/latest/) with a few tricks)

Answer 2 · 2016-09-05T18:01:12.000Z

Hi @rcatajar, thank you for the complement, and thank you for your contributions!

Yes, I think that would be very useful and would be interested in a PR.

I have a few ideas on how we might fold in the code. For instance, it could be useful to see if a document has any (readable) extractable embedded text before doing the transformations.

Do you think you could put something together?

Answer 3 · 2016-09-07T09:08:54.000Z

I have a busy week but I'll take a look and submit a PR by the end of the week

Answer 4 · 2016-09-09T12:56:47.000Z

Hey @rcatajar, wanted to check in. How's it coming?