PDF - The most machine-readable document format ever! Right? 🙈
Extracting text from PDF files is not a simple operation. PDF was never meant to be a format to read data from: its purpose is to provide an accurate way of reproducing documents and make them portable to any system. - How to read PDF files with RPA Framework!
Still, it is possible to automatically read and extract invoice data from PDF documents and save the data to an Excel file. No more manual copy & pasting!
This robot processes randomly generated PDF invoices with Amazon Textract and saves the extracted invoice data in an Excel file.
The robot provides three tasks:
Create Invoices
Process PDF invoices with Amazon Textract
Delete Files From Amazon S3 Bucket
- Generates random PDF invoices and uploads them to Amazon S3 bucket.
- Saves the generated PDF invoices to the output directory for debugging purposes.
- Reads the invoices from the Amazon S3 bucket.
- Processes the invoices with Amazon Textract.
- Saves the extracted invoice data in an Excel file in the output directory.
- Finally, deletes the PDF invoices from the Amazon S3 bucket.
- A utility task for deleting the PDF invoices from the Amazon S3 bucket.
- Can be executed separately when you want to empty the Amazon S3 bucket.
- Called by the
Process PDF invoices with Amazon Textract
task in the teardown phase.
The robot requires access to Amazon S3 and Amazon Textract services. It needs an API key, key ID, and the AWS region. Check out Amazon Textract Developer Guide!
Set up Robocorp Vault either locally or in Control Room.
For a local run, use the following configuration:
/Users/username/vault.json
:
{
"aws": {
"AWS_KEY": "aws-key",
"AWS_KEY_ID": "aws-key-id",
"AWS_REGION": "us-east-1"
}
}
devdata/env.json
:
{
"RPA_SECRET_MANAGER": "RPA.Robocorp.Vault.FileSecrets",
"RPA_SECRET_FILE": "/Users/username/vault.json"
}
For Control Room run, create a new vault entry in Control Room.
- Enter
aws
as the name. - Provide values for the
AWS_KEY
,AWS_KEY_ID
, andAWS_REGION
keys:
-
Run the
Create Invoices
task to create the PDF invoices. -
Run the
Process PDF invoices with Amazon Textract
task to process the PDF invoices and to generate the Excel file with the data extracted from the invoices.
Optional: Run the Delete Files From Amazon S3 Bucket
task if you want to delete the PDF invoices from the Amazon S3 bucket (the Process PDF invoices with Amazon Textract
task does this automatically in the teardown phase).
When running in Control Room, add the Create Invoices
and Process PDF invoices with Amazon Textract
as process steps: