Index your pile of papers with Amazon Textract, Amazon Comprehend and Amazon Elasticsearch Service

Overview

Background

You probably have piles of printed documents in your company, carefully stored in cabinets waiting for an audit, destruction or just to be used at some point. What if it was actually a gold mine, containing precious information on the history of the company you could use to predict and optimize your future? Or simply archive them securely on the cloud for compliance.

Or maybe you receive thousands of documents from you customers each day and want to digitize them and automate their processing.

Whatever the use case, the pipeline is almost always the same:

Scan the documents.
Digitize them, with an OCR (Optical Character Recognition) tool.
Process the document, enrich it, combine with other data, etc.
Index it, for further search.

To achieve this process today, many companies extract data from documents manually, which is quite slow and error-prone. Or by using OCR softwares that requires manual customization or hard-coded configuration to match their specific format.

Amazon Textract overcomes these challenges by using machine learning, providing the ability to "read" virtually any type of document to extract text and data without the need of custom code. Once the information is captured, you can use other services like Amazon Comprehend to get insight on it (key phrases, people, dates, ...), Amazon Translate to translate it, Elasticsearch to index it for later search.

New (Nov 16, 2020): You can now use Textract with handwriting documents. And it also supports 5 new languages (french, spanish, portuguese, italian and german) in addition to english.

Objective of the workshop

The workshop will demonstrate the usage of the following AWS services to achieve the process mentioned above:

Amazon Textract to extract text and data from scanned documents.
Amazon Comprehend to extract the essence of documents, entities.
Amazon Elasticsearch Service to index documents for further search.

The following services will also be used:

AWS Lambda will trigger and coordinate the full pipeline.
Amazon S3 will store the scanned documents.
Amazon Cognito will be used to secure the access to Kibana (provided with Elasticsearch service to visualize data)

Note: We'll use the Python programming language. If you are not familiar with this language, don't worry, the workshop will be guided and code will be provided. Just be aware that indentation is important to determine the structure of the code and statements. If you meet the following error, please review spaces and tabs in your code: “unindent does not match any outer indentation level”

Business case

In this workshop, we'll analyze documents from the Apollo Program, more precisely flight journals from Apollo 11 and Apollo 13, the two well-known travels to the Moon. If you are interested in the space history, you can find those documents here. For the workshop, you can find a subset in documents.

Image Credit: NASA, source

LAB 0 - Setup the environment

To start the workshop, we need to setup an environment with Cloudformation.

Proceed to Lab 0 - Setup the environment to complete setup.

LAB 1 - Extract text from documents with Amazon Textract

The first step will consist in using AWS Lambda to manipulate Amazon Textract API to extract text from documents.

Lab 1a - Synchronous | Lab 1b - Asynchronous

LAB 2 - Extract entities with Amazon Comprehend

In this step, we'll use Amazon Comprehend to extract entities from the text.

Lab 2a - Synchronous | Lab 2b - Asynchronous

LAB 3 - Index content in Amazon Elasticsearch Service

In this third step, we will store the content of the document, plus the detected entities in Amazon Elasticsearch Service, so that we can search after.

Lab 3a - Synchronous | Lab 3b - Asynchronous

Cleanup

The first thing to do is to empty the S3 bucket created in lab 0. Go to S3 in your workshop-textract-xyz bucket, and delete all files:

In the role where you manually added ComprehendReadOnly policy, you need to remove it manually (same thing for TranslateReadOnly or TextractFullAccess in the asynchronous version):

If you used the CloudFormation templates during the labs, got to CloudFormation stacks console and delete the different stacks you created.
For all manual creations, you will need to do it manually:
- remove the SNS Topic
- remove the second Lambda function (documentAnalysis)

LameLemon/workshop-textract-comprehend-es