/ArchivesSpace-ArcLight-Workflow

Scripts to export collections from ArchivesSpace and index them in ArcLight

Primary LanguagePythonThe UnlicenseUnlicense

ArchivesSpace, ArcLight, and Hyrax Workflow

This repo contains documentation and scripts for how the M.E. Grenander Department of Special Collections & Archives connects ArchivesSpace, ArcLight, and Hyrax and keeps everything synced together. It contains:

  • Documentation for uploading digital object in Hyrax using existing
  • Overnight exporting and indexing scripts that update data between each service

Updated documentation for this repo is on our documentation site:

Uploading Digital Objects to Hyrax with Existing Description

Uploading Digital Objects to Hyrax

  1. Go to Hyrax and login, or create an account and request uploading access

    • Let Greg know when you create an account and return when you have upload permissions.
  2. Once you have upload permissions, go to Arclight, find the file that represents the digital object you want to upload. From the URI, copy the long string of letters and numbers right after the “aspace_”. This is the unique ArchivesSpace ID for that record.

    • Notice the collection ID is in the URI as well.

Screenshot of getting ID from ArcLight URL

  1. In your Dashboard, select “Works” on the left side menu

Screenshot of adding a new work in Hyrax

  1. Select the “Add new work” button on the right side

Screenshot of adding a new work in Hyrax

  1. For most cases, select “Digital Archival Objects” and then the “Create Work” button.

Screenshot of adding a new DAO in Hyrax

  1. In the “Descriptions” tab, enter only the ArchivesSpace ID, and the Collection number

Screenshot of Pasting a ASpace ID while creating a new DAO in Hyrax

  1. Select the “Load Record” button to pull additional metadata from Arclight (JavaScript file)

Screenshot of automating import of metadata from ArcLight.

  1. Add additional Metadata, Resource Type and Rights Statement is required, while “Additional fields” are not

Screenshot of selecting a resource type.

  1. In the “Files” tab, browse and upload any files represented by the Arclight record. These can be PDFs, Office documents (doc, docx, ppt, xlsx, etc.), or any image file.

Screenshot of uploading a binary file to Hyrax

  1. Select the Visibility of the work on the right side, and Save the work.

Screenshot of selecting the visibility and saving a new work in Hyrax.

Overnight Export and Indexing Scripts

High-Level Overview

Diagram of how these script work to keep different services interconnected.

What Each Script Does

  • Each night, exportPublicData.py uses ArchivesSnake to query ArchivesSpace for resources updated since the last run.
  • For collections with the complete set of DACS-minimum elements it exports EAD 2002 files and for collections with only abstracts and extents it saves them to Pipe-delimited CSVs.
  • It also builds a CSV of local subjects and collection IDs.
  • All this data is pushed to Github.

Indexing Shell Scripts

  • Later, collection data is updated with git pull and indexNewEAD.sh indexes EAD files updated in the past day with find -mtime -1 into the ArcLight Solr instance.
  • There are also additional indexing shell scripts for ad hoc updates.
    • indexAllEAD.sh reindexes all EAD files
    • indexOneEAD.sh indexes only one EAD by collection ID (./indexOneEAD.sh apap101)
    • indexOneNDPA.sh indexes one NDPA EAD file, necessary because they have the same collection ID prefixes
    • indexNewNoLog.sh indexes one EAD file, but logs to the stdout instead of a log file
    • indexOneURL.sh indexes via a URL instead of from disk (not actively used)
  • Finally, processNewUploads.py queries the Hyrax Solr index for new uploads that are connected to ArchivesSpace ref_ids, but do not have accession numbers.
  • It downloads the new binaries and metadata and creates basic Archival Information Packages (AIPs) using bagit-python
  • It then uses ArchivesSnake to add a new Digital Object Record in ArchivesSpace that links to the object in Hyrax
  • Last, it adds a new accession ID in Hyrax
  • (Also check out Noah Huffman's talk that probably does this better [Direct Link].)
  • A simple library that converts Posix timestamps and ISO 8601 Dates to DACS-compliant display dates.
  • exportPublicData.py uses this to make dates for the static browse pages.

Example crontab

# get new image from Bing
0 2 * * * source /home/user/.bashrc; pyenv activate aspaceExport && python /opt/lib/ArchivesSpace-ArcLight-Workflow/image_a_day.py 1>> /media/SPE/indexing-logs/image_a_day.log 2>&1 && pyenv deactivate

# export data from ASpace
0 0 * * * source /home/user/.bashrc; pyenv activate aspaceExport && python /opt/lib/ArchivesSpace-ArcLight-Workflow/exportPublicData.py 1>> /media/SPE/indexing-logs/export.log 2>&1 && pyenv deactivate

# pull new EADs from Gitub
30 0 * * * echo "$(date) $line git pull" >> /media/SPE/indexing-logs/git.log && git --git-dir=/opt/lib/collections/.git --work-tree=/opt/lib/collections pull 1>> /media/SPE/indexing-logs/git.log 2>&1

# Index modified apap collections
5 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "apap"

# Index modified ua collections
15 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ua"

# Index modified ndpa collections
25 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ndpa"

# Index modified ger collections
35 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ger"

# Index modified mss collections
45 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "mss"

# Download new Hyrax uploads and create new ASpace digital objects
0 2 * * * source /home/user/.bashrc; pyenv activate processNewUploads && python /opt/lib/ArchivesSpace-ArcLight-Workflow/processNewUploads.py 1>> /media/SPE/indexing-logs/processNewUploads.log 2>&1 && pyenv deactivate