/vault_migration

Tools, ideas, and data for repository migration

Primary LanguagePythonOtherNOASSERTION

Repository Migrations

Tools, ideas, and data.

Semantics: EQUELLA objects are items with attachments. Invenio objects are records with files. EQUELLA has taxonomies; Invenio has vocabularies. We use these terms consistently so it's clear what format an object is in (e.g. python migrate/record.py item.json > record.json converts an item into a record).

Setup & Tests

poetry install # get dependencies
poetry shell # enter venv
python -m spacy download en_core_web_lg # download spacy model for Named Entity Recognition
pytest -v migrate/tests.py # run tests

Migrate scripts that create records require an INVENIO_TOKEN or TOKEN variable in our environment or .env file. To create a token: sign in as an admin and go to Applications > Personal access tokens.

Vocabularies

Invenio uses vocabularies to represent a number of fixtures beyond just subject headings, like names, description types, and creator roles. They're stored under the app_data directory and loaded when an instance is initialized. Many of our controlled lists in contribution wizards and EQUELLA taxonomies will be mapped to vocabularies.

The taxos dir contains exported EQUELLA taxonomies and tools for working with them. The vocab dir contains YAML files for Invenio vocabularies.

Subjects

We create two subject vocabularies: one for Library of Congress subjects with URIs from one of their authorities and one for CCA local subjects not present in any LC authority.

Download our subjects sheet and run python migrate/mk_subjects.py data/subjects.csv to create the YAML vocabularies in the vocab dir (lc.yaml and cca_local.yaml) as well as migrate/subjects_map.json which is used to convert the text of VAULT subject terms into Invenio identifiers or ID-less keyword subjects.

Copy the YAML vocabularies into the app_data/vocabularies directory of our Invenio instance. The site needs to be rebuilt to load the changes (invenio-cli services destroy and then invenio-cli services setup again). Eventually (Invenio v12) there will be a CLI command to alter vocabularies without rebuilding the site.

Creating Records in Invenio

  • migrate/record.py: Converts EQUELLA item JSON into Invenio record JSON
  • migrate/api.py: Converts an item and POSTs it to Invenio to create a record
  • migrate/import.py: Imports an item directory (created by the export tool) with its attachments to Invenio

To use these scripts, we must create a personal access token for an administrator account in Invenio:

  1. Sign in as an admin
  2. Go to Applications > Personal access tokens
  3. Create one—its name and the user:email scope (as of v12) do not matter
  4. Copy it to clipboard and Save
  5. Paste in .env and/or set it as an env var, e.g. set -x INVENIO_TOKEN=xyz in fish

Below, we migrate a VAULT item to an Invenio record and post it to Invenio.

set -x INVENIO_TOKEN=your_token_here
poetry run python migrate/api.py items/item.json # example output below
HTTP 201
https://127.0.0.1:5000/api/records/k7qk8-fqq15/draft
HTTP 202
{"id": "k7qk8-fqq15", "created": "2024-05-31T15:26:17.972009+00:00", ...
https://127.0.0.1:5000/records/k7qk8-fqq15

You can sometimes trip over yourself because Poetry automatically loads the .env file in the project root, which might contain an outdated personal access token. If API calls fail with 403 errors, check that the TOKEN and/or INVENIO_TOKEN environment variables are set correctly.

Rerunning the script with the same input creates multiple records, it doesn't update existing ones.

Items

We could write scripts to directly take an item from EQUELLA using its API, perform a metadata crosswalk, and post it to Invenio. Alternatively, we could work with local copies of items, perhaps created by the equella_scripts collection export tool.

We need to load the necessary fixtures, including user accounts, before adding to Invenio. For instance, the item owner needs to already be in Invenio before we can add them as owner of a record. If we attempt to load a record with a subject id that doesn't exist yet, we get a 500 error.

We download metadata for all items using equella-cli and a script like this:

#!/usr/bin/env fish
set total (eq search -l 1 | jq '.available')
set length 50 # can only download up to 50 at a time
set pages (math floor $total / $length)
for i in (seq 0 $pages)
  set start (math $i \* $length)
  echo "Downloading items $start to" (math $start + $length)
  # NOTE: no attachment info, use "--info all" for both attachments & metadata
  eq search -l $length --info metadata --start $start > json/$i.json
end

Metadata Crosswalk

We can use the item.metadata XML of existing VAULT items for testing. Generally, poetry run python migrate/record.py items/item.json | jq to see the JSON Invenio record. See our crosswalk diagrams.

Schemas:

It's likely our schema is outdated/inaccurate in places.

How to map a field:

  • Add a brief description to the mermaid diagram in docs/crosswalk.html
  • Write a test in tests.py with your input XML and expected record output
  • Write a Record method in migrate.py & use it in the Record::get() dict
  • Run tests, optionally run a record migration as described above

LICENSE

ECL Version 2.0