/readux-ingest-ecds

Django app for Readux ingest specific to ECDS' infrastructure

Primary LanguagePythonApache License 2.0Apache-2.0

Readux Ingest ECDS

Django app for Readux ingest specific to ECDS' infrastructure.

  1. Install
  2. Settings
  3. Process
    1. Local Ingest
    2. Bulk Ingest
    3. Remote Ingest

Install

pip install git+https://github.com/ecds/readux-ingest-ecds@develop

Add readux_ingest_ecds to the INSTALLED_APPS in config/settings/local.py

INSTALLED_APPS += ['readux_ingest_ecds']

Create and run the migrations.

python manage.py migrate readux_ingest_ecds

Settings

NOTE: All values are simple strings.

Setting Value
IIIF_MANIFEST_MODEL Model reference, eg. 'iiif.Manifest'
IIIF_IMAGE_SERVER_MODEL Model reference, eg. 'iiif.ImageServer'
IIIF_RELATED_LINK_MODEL Model reference, eg. 'iiif.RelatedLink'
IIIF_CANVAS_MODEL Model reference, eg. 'iiif.Canvas'
IIIF_COLLECTION_MODEL Model reference, eg. 'iiif.Collection'
IIIF_OCR_MODEL Model reference, eg. 'iiif.OCR'
INGEST_TMP_DIR Absolute path where files will be temporarily stored.
INGEST_PROCESSING_DIR Absolute path where Lambda will look for images.
INGEST_OCR_DIR Absolute path where OCR files will be preserved.
INGEST_TRIGGER_BUCKET S3 bucket that will trigger the PTiff Lambda function.

Process

Local Ingest

A person uploads a zip file with the following internal structure.

.
├──
│   └── metadata.(csv|tsv|xlsx)
│   └── images
│   │   └── 0000X.(tiff|jpg|png|gif|webp)
│   └── ocr
│   │   └── 0000X.(txt|tsv|xml|hocr)

Image Files

The "images" directory should contain all images sequentially named with numbers. Images can be in any format (other than PDF). Non-pyramidal tiffs will be converted during the ingest process.

OCR Files

OCR files file names should match its corresponding image. Readux currently supports hocr, Alto, and tab delimited (tsv).

Metadata File

The optional metadata file should be a spreadsheet. CSV is best, but TSV and Excel files are supported. The table below lists the supported column headers.

Header Description
PID UNIQUE identifier. If it is missing, Readux will assign one.
Label Volume Title, if the title is extremely long, you can abbreviate it and put the rest into the Summary.
Summary All descriptive information, you can use html <br/> to automatically add line breaks into the text.
Author Last name, First name, dates; separate multiple authors by semi-colon.
Published city City from publisher information.
Published date Date of publication.
Published date edtf Date of publication in extended date time format for search. Year can be the same (1688 = 1688) but a range changes (1688-1690 = 1688/1690).
Publisher Publisher from publisher information.
PDF Link to a file if available (optional).
Scanned by Usually "Emory Libraries"
Identifier The Library Call Number. 
Identifier uri Link to the item in the Library database.

How It Works

When the zip file is uploaded, the metadata file will be read, a new manifest/volume will be created. A background job will start unpacking all the image and OCR files and the person will be redirected to the edit form for the new manifest.

The background job will save teh OCR files and save all the image files in a staging directory. While the image files are being unpacked, each file name is added to a text file. That text file is uploaded to a specific S3 bucket. When the file is saved to the S3 bucket, an AWS Lambda function will convert each file in the list to a ptiff and save it in the image directory for the IIP server.

Bulk Ingest

Coming soon...

Remote Ingest

Coming soon...