ADSScanExplorerPipeline

Logic

The pipeline loops through the input folder structure identifying journal volumes and compare the file status to the ingestion db to detect any updates. The input folder should contain3 subfolders

  • bitmaps -- images
  • lists -- metadata
  • ocr -- ocr files

#TODO Write more of file strucutre

Setup

  • The pipeline needs at at minimum a DB to run the baseline ingestion pipeline.
  • An OpenSearch instance is needed to index the associated OCR files
  • A S3 Bucket is needed to upload the actual image files

Pipeline

Start with setting up the pipeline container. Make sure to set the input folder (with all image files, top files and ocr files) under volumes in the docker-compose.yaml. This will mount the folder into the container making it accessible to run the pipeline. Also make sure to set the S3 Bucket keys in the config.py file.

docker compose -f docker/pipeline/docker-compose.yaml up -d

This will start a Celery instance. If running on a dev environment you could be running without a RabbitMQ backend with setting CELERY_ALWAYS_EAGER=True in config.py

Open Search

Setup up the Open Search docker container

docker compose -f docker/os/docker-compose.yaml -f docker/os/{environment}.yaml up -d

Setup the index by running through the pipeline container:

docker exec -it ads_scan_explorer_pipeline python setup_os.py [--re-create] [--update-settings]

Database

Setup a postgresql container

docker compose -f docker/postgres/docker-compose.yaml up -d

Prepare the database:

docker exec -it postgres bash -c "psql -c \"CREATE ROLE scan_explorer WITH LOGIN PASSWORD 'scan_explorer';\""
docker exec -it postgres bash -c "psql -c \"CREATE DATABASE scan_explorer_pipeline;\""
docker exec -it postgres bash -c "psql -c \"GRANT CREATE ON DATABASE scan_explorer_pipeline TO scan_explorer;\""

Setup the tables by running through the pipeline container:

docker exec -it ads_scan_explorer_pipeline python setup_db.py [--re-create] 

Usage

The pipeline can be run in a couple of different setups.

For a pure dry-run to see which volumes would be detected without writing anything to db run:

docker exec -it ads_scan_explorer_pipeline python run.py --input-folder=/opt/ADS_scans_sample/ NEW --dry-run=True

Just check which volumes in the input folder that are new or have updated files. The volumes will be added to the db unde the table "journal_volume"

docker exec -it ads_scan_explorer_pipeline python run.py --input-folder=/opt/ADS_scans_sample/ NEW --process=False

Process all volumes with new or update status, updating the db and ocr index but leaving the heavier task of uploading all image files to the S3 Bucket

docker exec -it ads_scan_explorer_pipeline python run.py --input-folder=/opt/ADS_scans_sample/ --upload-files=n --index-ocr=y --upload-db=y NEW --process=True

Process a single or multiple volumes by id. Will be processed/reprocessed disregarding previous status. Id is either volume id (uuid) or journal + volume. Multiple ids can be input comma separated

docker exec -it ads_scan_explorer_pipeline python run.py --input-folder=/opt/ADS_scans_sample/ --upload-files=y --index-ocr=y SINGLE --id=lls..1969,c949f56b-cef6-43ea-b34c-cf5cc1bcdd41