dbashford/textract

Reading files from S3

rodrigogalindez opened this issue · 7 comments

Hi David!

Trying to create an endpoint in an Express server like this:

app.get('/textract', function(req, res, next) { textract("https://s3.amazonaws.com/testbucket1a2b3c/test.pdf", function(error, text) { console.log(error); res.end(); }); });

Console returns [Error: File at path [[ https://s3.amazonaws.com/testbucket1a2b3c/test.pdf ]] does not exist.]

What does this mean exactly? Textract only works with local files? (in this case my file is uploaded to S3). Thanks!

Yep, only local files.

OK, thanks. Any plan to make it work with remote files?

That seems a bit like scope creep on a singularly focused module. But I can consider adding such a thing. It probably makes more sense as something wrapped around textract instead of embedded within. The files would still need to be written locally.

Alright, thanks. Looking forward to your implementation. Apache Tika is very complicated to install, and yours is the only good text extractor that's written in node as far as I know.

Doing some refactoring and think I'll include this. Will be a few days.

Awesome. I've implemented textract in an order form for translation agencies (clients upload documents and the app returns the number of words & pricing) and it works very well. All the files are stored in an AWS instance for now and textract is in the same instance as well. I will refactor the app to work with S3 when it's ready. If it helps, here's how I plan to use textract:

textract was born out of the contracting work I did that involved uploading resumes, extracting the text from them, loading solr with the resume text for searching, and tossing the resume itself into S3. So that all sounds familiar. =)

Working through a set of enhancements over the next few days. This'll be one of them.