/tesseract-aws-lambda

AWS Lambda function to run tesseract OCR

Primary LanguageShell

Tesseract OCR on AWS Lambda

AWS Lambda function to run tesseract OCR

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

The idea is to use a docker container to simulate an AWS lambda environment this allows to build binaries against AWS lambda linux env. In this example I have build leptonica and Tesseract Open Source OCR Engine.

The whole idea is leveraged from here

Prerequisites

In order to get started you need docker. This is a very basic lamdba example and was tested on AWS Lambda Python3.8 environment. AWS deployment will be automated using serverless framework

Install the serverless framework

# Install serverless globally
npm install serverless -g
Generate AWS access keys

Follow the AWS tutorial to create access keys for your user.

Setup AWS access keys with serverless framework

Follow the Serverless tutorial

Building tesseract in Docker
docker build -t tesseract .
mkdir build
docker run -v $PWD/build:/tmp/build tesseract sh /tmp/build_tesseract.sh
Create a Lambda layer
mkdir layer
unzip build/tesseract.zip -d layer
mkdir -p layer/python/lib/python3.8/site-packages/
pip install pytesseract -t layer/python/lib/python3.8/site-packages/
Verify the folder layer has been created and contains the following folders
ls layer
tesseract #compiled tesseract binary
tessdata #tesseract language package eng
lib #compiled lib dependencies
python #python dependencies
Package the lambda layer
serverless package
Deploy Tesseract on AWS Lambda
serverless deploy
Test OCR Lambda function

The lambda function is accepting json post request The URl will be which was printed from serverless deploy command

{
  "image64": "base64 encoded image"
}

Built With