This is serverless service that uses AWS Textract to extract text from images and PDFs. The communication with the service is done via EventBridge. The service is deployed using AWS CDK.
- AWS CDK
- AWS Textract
- AWS EventBridge
- AWS DynamoDB
- AWS S3
- AWS Lambda
- AWS SQS
For this arquitecture we need to consider the following:
-
All the communication with the service is done via EventBridge, this allows us to decouple the service from the client.
-
For Amazon Textract I used
QUERIES
to extract the text from the images and PDFs. The idea is to select only the text that we need to extract and not the whole document. This way is cheaper than usingFORMS
andTABLES
. -
The configuration for the QUERIES is done in a DynamoDB table. This allows us to change the configuration without the need to redeploy the service.
The configuration table is a DynamoDB table with the following structure:
pk | sk | query | type |
---|---|---|---|
document-A | address | ["Address"] | query |
document-A | phone_number | ["Phone number"] | query |
- The
pk
is the document name and thesk
is the field name. - The
query
is the list of words that we want to extract from the document. - The
type
could bequery
orexistance
. Theexistance
type is used to check if a word exists in the document.
- Clone the repository
- Install the dependencies
pnpm install
- Set up the AWS credentials
aws configure
- In the
bin/document-extract.ts
fill in theAWS_ACCOUNT
andAWS_REGION
variables. - Deploy the service
cdk deploy
- Fill in the configuration table with the documents and fields that you want to extract. In the example's folder you can find an example of the configuration table. You can run the example with
npx ts-node configuration.ts
to fill the table. - Upload the document to the S3 bucket. You can use the example's folder to upload the example document to the S3 bucket. You can run the example with
npx ts-node upload.ts
. - Send an event to the service. You can use the example's folder to send the event. You can run the example with
npx ts-node send-event.ts
. - Go to AWS Console and check the results in the DynamoDB table.
This is an example of the configuration table:
This is an example of the input document:
This is an example of the result:
Any feedback is welcome!