/pdf-to-alto

Library to extract ALTO from PDF

Primary LanguagePython

PDF to ALTO

Service that listens for incoming messages from SQS. On receipt of a message this service will download reference PDF, extract ALTO file per-page and upload to specified S3 bucket.

If COMPLETED_TOPIC_ARN env var specified a notification will be raised.

Messages Format

The incoming message is in the shape:

{
  "pdfLocation": "https://www.hq.nasa.gov/alsj/a17/A17_FlightPlan.pdf",
  "pdfIdentifier": "a17_flightplan",
  "outputLocation": "s3://pdf-to-alto/a17_flightplan_alto"
}

Where

  • pdfLocation - the URL where PDF can be downloaded from.
  • pdfIdentifier - unique identifier for PDF. where i is 0-based page index. If omitted a random uuid will be used.
  • outputLocation - s3 location where final ALTO files will be output. With or without preceding s3:// and no trailing /.

(See sample.json)

The completed notification message echos back the original message with "numberOfFiles" property added.

The generated alto file will be placed in outputLocation. The format of each file will depend on value of PREPEND_ID envvar. This format will be (where i is the page number):

  • If true: f"{pdfIdentifier}-{i:04d}.xml"
  • else f"{i:04d}.xml",

Technology

This is a Python script that utilises the following libraries:

  • pdfalto - C lib used to generate ALTO files.
  • PyMuPDF - Python lib used to query PDF object for page dimensions.
  • lxml - Used to parse ALTO files and update scaled values.
  • requests - Used to download PDF files.

Configuration

The following environment variables can be used to configure the app:

Env Var Description Default
DOWNLOAD_CHUNK_SIZE Chunk size for downloading PDF 2048
WORKING_FOLDER Local working folder for storing generated files ./work
REMOVE_WORK_DIR Whether to clean up working dir on completion True
RESCALE_ALTO Whether to rescale generated ALTO to page True
MONITOR_SLEEP_SECS How long to sleep long polling operations if no messages received 30
AWS_REGION AWS region being used eu-west-1
INCOMING_QUEUE The name of the SQS queue to monitor for incoming messages. Required
COMPLETED_TOPIC_ARN The ARN of a topic to post completion notifications to
LOCALSTACK If using LocalStack False
LOCALSTACK_ADDRESS Address for LocalStack instance http://localhost:4566

(See .env.dist for sample .env file)

Running Locally

There is a multi-stage Dockerfile that builds the pdfalto binary and copies it to a new stage.

docker-compose.yml will build and start the main Python app alongside a LocalStack instance.

# build and start image using LocalStack
docker-compose up

# build image
docker build --tag pdf-to-alto:local .

# run docker image and listen to queue
docker run --env-file .env -it --rm --name pdf-to-alto pdf-to-alto:local

# run docker image to process 1 single api
docker run -it --rm --name pdf-to-alto \
  pdf-to-alto:local \
  opt/app/app/pdf_processor.py https://text.example/test.pdf my-pdf-identifier s3://pdf-bucket/alto

Note: Building pdfalto from source takes a few minutes

LocalStack

The docker-compose.local.yml file will spin up a LocalStack instance and configure a few resource for local testing:

  • An S3 bucket titled "pdf-to-alto"
  • An SNS topic "incoming-topic" with an SQS subscription to "incoming"
  • An SNS topic "completed-topic" with an SQS subscription to "completed"
docker-compose -f docker-compose.local.yml up

To use LocalStack set the LOCALSTACK and LOCALSTACK_ADDRESS env vars (see above).

When using the aws-cli with LocalStack the --endpoint-url needs to be specified. Below are some handy commands to use when testing:

# raise sample notification using sample.json
aws --endpoint-url=http://localhost:4566 sns publish --topic-arn arn:aws:sns:eu-west-1:000000000000:incoming-topic --message file://sample.json --region eu-west-1

# clear incoming queue
aws --endpoint-url=http://localhost:4566 sqs purge-queue --queue-url "http://localstack:4566/000000000000/incoming" --region eu-west-1

# check number of 'completed' notifications raised
aws --endpoint-url=http://localhost:4566 sqs get-queue-attributes --queue-url "http://localstack:4566/000000000000/completed" --attribute-names All --region eu-west-1

# check contents of s3
aws --endpoint-url=http://localhost:4566 s3 ls pdf-to-alto --recursive --region eu-west-1