Amazon Textract Workbench

Quote from the official website:

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Today, many companies manually extract data from scanned documents such as PDFs, images, tables, and forms, or through simple OCR software that requires manual configuration (which often must be updated when the form changes). To overcome these manual and expensive processes, Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort. You can quickly automate document processing and act on the information extracted, whether you’re automating loans processing or extracting information from invoices and receipts. Textract can extract the data in minutes instead of hours or days. Additionally, you can add human reviews with Amazon Augmented AI to provide oversight of your models and check sensitive data.

This repo is aimed at giving you a place to experiment with the tooling and will show you a step by step tutorial on how to take advantage of the geometric context detected in an image to make the tagging of key and value pairs easier and more accurate with Amazon Textract. We are running this application on top of SageMaker Studio Lab combining multiple libraries such as Hugging Face, SpaCy, and Textractor. We will also make use of the recently launched "Queries" functionality in Textract.

end-to-end-demo.mov

Getting started

Requirements

SageMaker Studio Lab account. See this explainer video to learn more about this.
Python==3.7
Streamlit
TensorFlow==2.5.0
PyTorch>=1.10
Hugging Face Transformers
Other libraries (see environment.yml)

Step by step tutorial

Clone repo, install dependencies, and launch your app

Follow the steps shown in launch_app.ipynb Click on Copy to project in the top right corner. This will open the Studio Lab web interface and ask you whether you want to clone the entire repo or just the Notebook. Clone the entire repo and click Yes when asked about building the Conda environment automatically. You will now be running on top of a Python environment with Streamlit and Gradio already installed along with other libraries.

Your Streamlit app will be running on f'https://{studiolab_domain}.studio.{studiolab_region}.sagemaker.aws/studiolab/default/jupyter/proxy/6006/'

Pre-process image and compare modified vs original

This is example code to implement SauvolaNet, an end-to-end document binarization solution:

from os.path import exists as path_exists
path_repo_sauvolanet = 'dependencies/SauvolaNet'
if not path_exists(path_repo_sauvolanet):
    os.system(f'git clone https://github.com/Leedeng/SauvolaNet.git {path_repo_sauvolanet}')
sys.path.append(f'{path_repo_sauvolanet}/SauvolaDocBin/')
pd.set_option('display.float_format','{:.4f}'.format)
from dataUtils import collect_binarization_by_dataset, DataGenerator
from testUtils import prepare_inference, find_best_model
from layerUtils import *
from metrics import *

@st.experimental_singleton
def sauvolanet_load_model(model_root = f'{path_repo_sauvolanet}/pretrained_models/'):
    for this in os.listdir(model_root) :
        if this.endswith('.h5') :
            model_filepath = os.path.join(model_root, this)
            model = prepare_inference(model_filepath)
            print(model_filepath)
    return model

def sauvolanet_read_decode_image(model, im):
    rgb = np.array(im)
    gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
    x = gray.astype('float32')[None, ..., None]/255.
    pred = model.predict(x)
    return Image.fromarray(pred[0,...,0] > 0)

...

with st.spinner():
    sauvolanet_model = sauvolanet_load_model()
    modified_image = sauvolanet_read_decode_image(sauvolanet_model,input_image)
    st.success('Done!')

Finally we use streamlit-image-comparison to display both images (modified, original) next to each other.

with st.expander("See modified image"):
    image_comparison(
        img1=input_image, img2=modified_image,
        label1='Original', label2='Modified',
    )

Here's a demo:

streamlit-image-comparison.mov

Make API request to Textract with `Queries`

Recently, Amazon released a new functionality in Textract called "Queries". You can think of it as VQA where you can ask questions to your scanned documents and based on image and language context you will get the most likely response. You can see the official documentation here and a sample Notebook here.

Here's some sample code:

# Call Textract AnalyzeDocument by passing a document from local disk
response = textract.analyze_document(
    Document={'Bytes': imageBytes},
    FeatureTypes=["QUERIES"],
    QueriesConfig={
        "Queries": [{
            "Text": "What is the year to date gross pay",
            "Alias": "PAYSTUB_YTD_GROSS"
        },
        {
            "Text": "What is the current gross pay?",
            "Alias": "PAYSTUB_CURRENT_GROSS"
        }]
    })

And this is how it looks on the Streamlit application that you will deploy with this repo:

textract-queries.mov

Use response with Amazon Comprehend, HuggingFace, SpaCy, etc.

Example code to parse the output from Amazon Textract and use it with Hugging Face models under the task summarization.

def parse_response(response):
    from trp import Document
    doc = Document(response)
    text = ''
    for page in doc.pages:
        for line in page.lines:
            for word in line.words:
                text = text + word.text + ' '
    return text.strip()

@st.experimental_singleton
def load_model_pipeline(task, model_name):
    return pipeline(task, model=model_name)

...

with st.spinner('Downloading model weights and loading...'):
    pipe = load_model_pipeline(task="summarization", model_name=options)
    summary = pipe(
        parse_response(st.session_state.response), 
        max_length=130, min_length=30, do_sample=False)

with st.expander('View response'):
    st.write(summary)

(Additional) Extract Information Using "Geometric Context" and Amazon Textract

Quote from this repository by Martin Schade.

To find information in a document based on geometry with this library the main advantage over defining x,y coordinates where the expected value should be is the concept of an area. An area is ultimately defined by a box with x_min, y_min, x_max, y_max coordinates but can be defined by finding words/phrases in the document and then use to create the area. From there, functions to parse the information in the area help to extract the information. E. g. by defining the area based on the question like 'Did you feel fever or feverish lately?' we can associate the answers to it and create a new key/value pair specific to this question.

You can find a notebook sample with a step by step tutorial under notebook_samples/extract-info-geometric-context.ipynb.

Next steps: How to do large scale document processing with Amazon Textract

The Document Understanding Solution (DUS) delivers an easy-to-use web application that ingests and analyzes files, extracts text from documents, identifies structural data (tables, key value pairs), extracts critical information (entities), and creates smart search indexes from the data. Additionally, files can be uploaded directly to and analyzed files can be accessed from an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account.

This solution uses AWS artificial intelligence (AI) services that address business problems that apply to various industry verticals:

Search and discovery: Search for information across multiple scanned documents, PDFs, and images

Compliance: Redact information from documents

Workflow automation: Easily plugs into your existing upstream and downstream applications

Additionally, serverless endpoints are a great way to create microservices that can be easily used within your document processing pipelines. From this blog post by Philipp Schmid.

Amazon SageMaker Serverless Inference is a new capability in SageMaker that enables you to deploy and scale ML models in a Serverless fashion. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic similar to AWS Lambda. Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. With a pay-per-use model, Serverless Inference is a cost-effective option if you have an infrequent or unpredictable traffic pattern.

Additional resources

References

@INPROCEEDINGS{9506664,  
  author={Li, Deng and Wu, Yue and Zhou, Yicong},  
  booktitle={The 16th International Conference on Document Analysis and Recognition (ICDAR)},   
  title={SauvolaNet: Learning Adaptive Sauvola Network for Degraded Document Binarization},   
  year={2021},  
  volume={},  
  number={},  
  pages={538–553},  
  doi={https://doi.org/10.1007/978-3-030-86337-1_36}}
  
@article{zhang2022practical,
title={Practical Blind Denoising via Swin-Conv-UNet and Data Synthesis},
author={Zhang, Kai and Li, Yawei and Liang, Jingyun and Cao, Jiezhang and Zhang, Yulun and Tang, Hao and Timofte, Radu and Van Gool, Luc},
journal={arXiv preprint},
year={2022}
}

Disclaimer

The content provided in this repository is for demonstration purposes and not meant for production. You should use your own discretion when using the content.
The ideas and opinions outlined in these examples are my own and do not represent the opinions of AWS.

machinelearnear/amazon-textract-workbench