lambda-text-extractor
is a Python 3.6 app that works with the AWS Lambda architecture to extract text from common binary document formats.
Some of its key features are:
- out of the box support for many common binary document formats (see section on Supported Formats),
- scalable PDF parsing using OCR in parallel using AWS Lambda and asyncio,
- creation of text searchable PDFs after OCR,
- serverless architecture makes deployment quick and easy,
- detailed instruction for preparing libraries and dependencies necessary for processing binary documents, and
- sensible unicode handling
lambda-text-extractor
supports many common and legacy document formats:
- Portable Document Format (
.pdf
),- PDFs with a text layer using Poppler utilities,
- PDFs with OCR using Tesseract and Ghostscript 9.21 for PDF manipulation,
- Microsoft Word 2, 6, 7, 97, 2000, 2002 and 2003 (
.doc
) using Antiword with fallback to Catdoc, - Microsoft Word 2007 OpenXML files (
.docx
) using python-docx, - Microsoft PowerPoint 2007 OpenXML files (
.pptx
) using python-pptx, - Microsoft Excel 5.0, 97-2003, and 2007 OpenXML files (
.xls
,.xlsx
) using xlrd, - OpenDocument 1.2 (
.odm
,.odp
,.ods
,.odt
,.oth
,.otm
,.otp
,.ots
,.ott
) using odfpy, - Rich Text Format (
.rtf
) using UnRTF v0.21.9, - XML files and HTML web pages (
.html
,.htm
,.xml
) using lxml, - CSV files (
.csv
) using Python csv module, - Images (
.tiff
,.jpg
,.jpeg
,.png
) using Tesseract, and - Plain text files (
.txt
)
Due to the size of code and dependencies (and AWS Lambda's 50MB package limits), the extraction system is split into two Lambda functions: simple
and ocr
.
ocr
supports extracting text from images and "image" PDFs, while simple
handles text extraction from the remaining formats.
The side benefit of splitting into two functions is that we can configure the memory requirements of the two functions independently.
We use apex for our development toolchain to deploy the AWS Lambda functions; the code for the two Lambda functions are found in the functions directory.
To deploy to AWS (Note that the -D
argument refers to dry run mode.)
apex -D deploy
You need to ensure your IAM role has lambda:InvokeAsync
permissions, and s3:PutObject
permissions on the output bucket.
Generally, we would advice using a specific bucket with auto-delete lifecycle rules for the temporary storage.
You can set the IAM role and other configuration options in project.json.
The speed of parsing depends on CPU and this is controlled by the amount of memory allocated to your Lambda functions.
For our needs, we find that 512MB for simple
and 1024MB for ocr
is a good balance between performance and cost.
The simple
function expects an event
with
document_uri
: A URI containing the document to extract text from, i.e.,s3://bucket/key.pdf
.temp_uri_prefix
(optional): A URI prefix where temporary files can be stored. Defaults to<document_uri>-temp
if not set.text_uri
(optional): A URI where the extracted text will be stored, i.e.,s3://bucket/key.txt
. Defaults to<document_uri>.txt
if not set.disable_ocr
(optional): Whether to disable OCR feature. Defaults toFalse
.
aws lambda invoke --function-name textractor_simple --payload '{"document_uri": "https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf", "temp_uri_prefix": "s3://bucket/", "text_uri": "s3://bucket/tracemonkey.txt"}' -
aws s3 cp s3://bucket/tracemonkey.txt -
It automatically fallbacks to ocr
function when:
- file is a PDF (i.e., ends with
.pdf
), - text content is shorter than 32 characters, and
disable_ocr
isFalse
.
The ocr
expects the same event
as simple
with the following additional fields:
searchable_pdf_uri
: A URI where searchable version of the PDF file is stored. Defaults to<document_uri>.searchable.pdf
create_searchable_pdf
: Whether to create searchable PDFs. Defaults toTrue
.page
: Page number of perform PDF OCR extraction. Defaults to all pages.
Searchable PDF creation may take significantly longer than just text extraction. As there are multiple steps in OCR PDF extraction, there are several additional variables (set through environment variables) to configure its behavior.
MERGE_SEARCHABLE_PDF_DURATION
: The maximum number of seconds to take for searchable PDF merging. Defaults to 90 seconds.RETURN_RESULTS_DURATION
: The number of seconds to reserve at the end for compiling results and returning them. Defaults to 3 seconds.TEXTRACT_OUTPUT_WAIT_BUFFER_TIME
: The number of seconds to reserve for the overhead in async wait of each page's OCR Lambda functions to return. Defaults to 5 seconds.
For more details about how PDF OCR extraction work here, see section on PDF OCR Extraction.
aws lambda invoke --function-name textractor_ocr --payload '{"document_uri": "https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf", "temp_uri_prefix": "s3://bucket/", "text_uri": "s3://bucket/tracemonkey.txt", "searchable_pdf_uri": "s3://bucket/tracemonkey.searchable.pdf"}' -
aws s3 cp s3://bucket/tracemonkey-5.txt -
Due to the slow nature of OCR on images and AWS Lambda's 300 seconds execution limit, we used a hack (i.e., another lambda invocation) to OCR the pages of a PDF in parallel, while using S3 as our temporary store.
When we determine that a PDF needs to be processed using OCR (i.e., simple
text extraction yields < 512 bytes), we automatically invoke ocr
and wait for the results asynchronously for each page of the PDF (we use asyncio and aiobotocore to achieve this).
The page
field in event
determines which page we want to OCR for that function call.
Basically, the steps for OCR extraction are as follows:
- Determine the number of pages in the PDF using
pdfinfo
. We find that this subprocess call is faster (and more robust) than using a Python PDF library like PyPDF2. - Invoke
ocr
on each page of the document by passing in thepage
field. We store the intermediate output (i.e., extracted text and searchable PDFs for each page) in thetemp_uri_prefix
folder. We wait for the Lambda function calls in step 2 to complete usingawait
. - We download the intermediate outputs to the Lambda function's local filesystem.
- We combine the intermediate text and searchable PDF, ignoring missing pages and files. The missing information will be stored in the metadata of the final
text_uri
andsearchable_pdf_uri
asmissing_text_pages
andmissing_searchable_pdf_pages
respectively.
For step 2 and 3, it is done concurrently and asynchronously and we set a timeout based on
REMAINING_TIME - MERGE_SEARCHABLE_PDF_DURATION - RETURN_RESULTS_DURATION - TEXTRACT_OUTPUT_WAIT_BUFFER_TIME
where REMAINING_TIME
is the amount of time remaining after step 1.
Based on our experience, merging searchable PDFs take quite a while (and depends on the number of pages you have).
On average, it can take about 60 seconds for merging 100 pages of searchable PDFs.
If this is an issue for you, you might want to modify the code to fix the path of the intermediate outputs and combine it yourself outside the Lambda infrastructure.
Currently, we use random UUIDs for the filenames of each intermediate output page.
The relevant part of the code is in the _invoke_textract_ocr_tasks
method.
For OCR extractions on individual pages, we use Ghostscript to extract the page into an image with basic image processing and then use Tesseract to do text extraction.
If create_searchable_pdf
is enabled, Tesseract is used to directly create a searchable PDF.
After which, we use pdftotext
for regular text extraction from the searchable PDF (instead of running Tesseract twice).
If anybody knows of a better pattern for processing PDFs, do feel free to submit a pull request!
For more information on how we prepped the Lambda execution environment to run all these external software and libraries, see Building Binaries.