jlsutherland/doc2text

Support for non scanned documents (.doc, .docx, regular pdf)

Opened this issue · 4 comments

Hi @jlsutherland and thanks for this cool module, OCR is a hard problem and you provide a pretty efficient and simple solution.

Would you be interested by PR with text extraction for non-scanned documents ? I think it fits the module name "doc2text" quite well but maybe you want to stick with just OCR, let me know

We already have a bunch of codes for that in my company that I'm going to refactor, so I can provides a PR. (we mainly use textract (http://textract.readthedocs.io/en/latest/) with a few tricks)

Hi @rcatajar, thank you for the complement, and thank you for your contributions!

Yes, I think that would be very useful and would be interested in a PR.

I have a few ideas on how we might fold in the code. For instance, it could be useful to see if a document has any (readable) extractable embedded text before doing the transformations.

Do you think you could put something together?

I have a busy week but I'll take a look and submit a PR by the end of the week

Hey @rcatajar, wanted to check in. How's it coming?