/aspace_preservica_db

A reporting database for collating and doing stuff with ArchivesSpace and Preservica metadata

Primary LanguagePythonMIT LicenseMIT

aspace_preservica_db

A reporting database for collating and doing stuff with ArchivesSpace and Preservica metadata

Tables

The tables in the database and how they are generated

archival_object

This table is populated by running the archival_object_table.sql query against the ArchivesSpace database.

Column Description Type
id The archival object id int
repo_id The repository id int
root_record_id The parent resource id int
parent_id The parent archival object int
ref_id The ref id varchar(255)
component_id The component unique id varchar(255)
title The archival object title varchar(8704)
publish 1 published, 0 unpublished int
level Hierarchical level (i.e. file) varchar(255)
preservica_collection_id Found in the note field varchar(255)
extent Concatenated extent value varchar(255)
physical_containers Concatenated container data mediumtext
create_time ArchivesSpace creation time timestamp
m_time ArchivesSpace last modified timestamp

digital_object

This table is populated by running the digital_object_table.sql query against the ArchivesSpace database.

Column Description Type
id The digital object id int
digital_object_id The Preservica DelUnit id varchar(255)
archival_object_id The linked record id int
title The digital object title varchar(255)
publish 1 published, 0 unpublished int
has_content_link 1 has link. 0 no link int
create_time ArchivesSpace creation time timestamp
m_time ArchivesSpace creation time timestamp

digital_object_component

This table is populated by running the digital_object_component_table.sql query against the ArchivesSpacee database.

Column Description Type
id The digital object component id int
repo_id The repository id int
root_record_id The root digital object id int
parent_id The parent digital object id int
component_id The component unique id varchar(255)
title The component title varchar(8704)
publish 1 published, 0 unpublished int
create_time ArchivesSpace creation time timestamp
m_time ArchivesSpace creation time timestamp

restriction

This table is populated by running the restrictions_table.sql query against the ArchivesSpace database. This query can only be run on a local/test version of YUL ArchivesSpace, as it requires SQL common table expressions (CTEs) that only exist in MySQL 8+. The production version of YUL ArchivesSpace should be upgraded to MySQL 8 in early 2021.

Column Description Type
id The archival object id int
resource_note_text The resource-level note longtext
lvl The note level int
path The restriction text hierarchy longtext
type_path The restriction type hierarchy longtext
end_path The restriction date hierarchy longtext

hierarchy

This table is populated by running the hierarchies.sql query against the ArchivesSpace database. The hierarchies table is a view that only exists in a local/test version of YUL ArchivesSpace, as it requires SQL common table expressions (CTEs) that only exist in MySQL 8+. The production version of YUL ArchivesSpace should be upgraded to MySQL 8 in early 2021.

Column Description Type
id The archival object id int
root_record_id The resource id int
repo_id The repository id int
full_path The full hierarchy longtext
lvl The hierarchical level int

collection

This table is populated by extracting data from Preservica collection XML files. The collection IDs are stored in notes within ArchivesSpace archival object records, and are extracted by running the get_collection_ids.sql query against the ArchivesSpace database.

Column Description Type
id Preservica collection id varchar(255)
parent_collection_id Preservica parent collection varchar(255)
collection_code Preservica collection code varchar(255)
security_tag Preservica security tag varchar(255)
title Preservica collection title varchar(8704)
create_time Local database create time timestamp
m_time Local database last modified timestamp

deliverable_unit

This table is populated by extracting data from Preservica deliverable unit XML files. The deliverable unit IDs are stored in digital object records in ArchivesSpace, and are extracted by running the get_deliverable_unit_ids.sql query against the ArchivesSpace database.

Column Description Type
id The digital object id int
collection_id The Preservica collection id varchar(255)
parent_deliverable_unit The parent deliverable unit id varchar(255)
root_parent_del_unit The root deliverable unit id varchar(255)
deliverable_unit_level The deliverable unit level int
digital_surrogate yes or no varchar(255)
coverage_from Start date timestamp
coverage_to End date timestamp
security_tag Security tag varchar(255)
create_time Local database create time timestamp
m_time Local database last modified timestamp

manifestation

This table is populated by extracting data from Preservica deliverable unit XML files.

Column Description Type
id Preservica manifestation id varchar(255)
deliverable_unit_id Deliverable unit id varchar(255)
typeref The manifestation type varchar(255)
summary Preservica summary varchar(8704)
create_time Local database create time timestamp
m_time Local database last modified timestamp

digital_file

This table is populated by extracting data from Preservica digital file XML files. The list of digital files is derived from Preservica deliverable unit XML files.

Column Description Type
id The digital file id int
manifestation_id The manifestation id int
file_set_id The file set id int
filesize The file size in bytes int
filemoddate The file modification date timestamp
filename The name of the file varchar(255)
working_path The relative path to the file varchar(255)
format_puid The format unique id varchar(255)
format_name The format name mediumtext
mimetype The format mimetype varchar(255)
create_time Local database creation time timestamp
m_time Local database last modified timestamp

Extracting Data from Source Databases

How to extract data from ArchivesSpace and Preservica

ArchivesSpace

Data is extracted from ArchivesSpace via several SQL queries

Preservica

Data is extracted from Preservica by calling the Preservica API, as access to the Preservica database is restricted.

Initializing the aspace_preservica_db Database

How to create and populate the database using extracted data from ArchivesSpace and Preservica

Run process_xml_for_db.py

The first step in initializing the aspace_preservica_db database is running the process_xml_for_db.py script on extracted Preservica XML files.

Run db_manager.py

Once the CSV spreadsheets are prepared for ingest, it is time to run the db_manager.py script to create and populate the database.

Run SQL updates to fix field formatting

After the data is ingested, a little bit of clean-up is necessary.

Derive parent deliverable unit data

Updating the aspace_preservica_db Database

Checking for new data

Adding new data to the database

Design Notes and To-Dos

Reconciling missing data

Some data was missed during the initial load:

  • Collection data
  • Child deliverable unit data
  • Digital object component data
  • Restriction data

Consider re-designing manifestation table, since there are duplicate manifestation IDs; or at least sort out those issues.