Overview
This code, written to be executed as an AWS Lambda function, uses the Slate module to extract the text from a PDF file, and then indexes that text to an ElasticSearch cluster. It is designed to be invoked when a PDF document is put to an S3 bucket.
A few implementation notes:
- Because this is just a simple PoC, the only text data index to Elasticsearch is on the first page
- Play around with the Lambda timeout time to set something that works for document sizes you're placing in the S3 bucket
- For smaller PDF docs, I've observed memory utilization (in CWL) of low 10s of Mbytes
- This assumes some familiarity with AWS Lambda basics (configuring events sources, invocation policies, etc)
- Specify a suffix of 'pdf' to make sure it's only executing for pdf files
To be implemented:
- Signing of POSTs to Elasticsearch endpoints using SigV4, instead of using python modules