Reading files from S3

Question

Reading files from S3

rodrigogalindez opened this issue 9 years ago · 7 comments

Hi David!

Trying to create an endpoint in an Express server like this:

app.get('/textract', function(req, res, next) { textract("https://s3.amazonaws.com/testbucket1a2b3c/test.pdf", function(error, text) { console.log(error); res.end(); }); });

Console returns [Error: File at path [[ https://s3.amazonaws.com/testbucket1a2b3c/test.pdf ]] does not exist.]

What does this mean exactly? Textract only works with local files? (in this case my file is uploaded to S3). Thanks!

Answer 1 · 2015-07-14T22:21:28.000Z

Yep, only local files.

Answer 2 · 2015-07-14T22:24:11.000Z

OK, thanks. Any plan to make it work with remote files?

Answer 3 · 2015-07-14T22:31:50.000Z

That seems a bit like scope creep on a singularly focused module. But I can consider adding such a thing. It probably makes more sense as something wrapped around textract instead of embedded within. The files would still need to be written locally.

Answer 4 · 2015-07-14T22:35:05.000Z

Alright, thanks. Looking forward to your implementation. Apache Tika is very complicated to install, and yours is the only good text extractor that's written in node as far as I know.

Answer 5 · 2015-07-23T11:41:32.000Z

Doing some refactoring and think I'll include this. Will be a few days.

Answer 6 · 2015-07-23T18:31:29.000Z

Awesome. I've implemented textract in an order form for translation agencies (clients upload documents and the app returns the number of words & pricing) and it works very well. All the files are stored in an AWS instance for now and textract is in the same instance as well. I will refactor the app to work with S3 when it's ready. If it helps, here's how I plan to use textract:

Client uploads a file to S3 (I use ng-file-upload: https://github.com/danialfarid/ng-file-upload)
Endpoint runs textract with path = S3 path and returns estimates

Answer 7 · 2015-07-23T18:33:26.000Z

textract was born out of the contracting work I did that involved uploading resumes, extracting the text from them, loading solr with the resume text for searching, and tossing the resume itself into S3. So that all sounds familiar. =)

Working through a set of enhancements over the next few days. This'll be one of them.