/dpres-siptools

Pre-Ingest Tool for creating submission information packages

Primary LanguagePythonGNU Lesser General Public License v3.0LGPL-3.0

NOTE: This project is deprecated and is no longer maintained. Please use the new Pre-Ingest Library instead: https://github.com/Digital-Preservation-Finland/dpres-siptools-ng

Pre-Ingest Tool

This tool is intended to be used for generating an OAIS SIP for digital preservation. It produces a METS document (mets.xml) that contains metadata for digital preservation required by the specifications used in the Finnish national Digital Preservation Services. The tool contains code for extracting metadata, creating and digitally signing the METS document.

The aim is to provide digital preservation services for culture and research to ensure the access and use of materials long in the future. Documentation and specifications for the digital preservation services can be found in: http://digitalpreservation.fi

The Pre-Ingest Tool currently supports the specification version 1.7.6.

Release notes and backwards compability

See RELEASE_NOTES.rst

Requirements

Installation and usage requires Python 3.9 or newer. The software is tested with Python 3.9 on AlmaLinux 9 release.

Installation using RPM packages (preferred)

Installation on Linux distributions is done by using the RPM Package Manager. See how to configure the PAS-jakelu RPM repositories to setup necessary software sources.

After the repository has been added, the package can be installed by running the following command:

sudo dnf install python3-dpres-siptools

Scripts

import-description
for adding a descriptive metadata section to a METS document.
premis-event
for creating digital provenance metadata.
import-object
for adding technical metadata for digital objects to a METS document.
create-mix
for creating MIX metadata for image files.
create-addml
for creating ADDML metadata for csv files.
create-audiomd
for creating AudioMD metadata for audio streams.
create-videomd
for creating VideoMD metadata for video streams.
compile-structmap
for creating the file section and structural map.
compile-mets
for compiling all previously created metadata files in a METS document.
sign-mets
for digitally signing the submission information package.
compress
for wrapping the created submission information package directory to a TAR file.
create-agent
helper function to create detailed agent metadata to be used with the premis-event script

Usage

In order to build a SIP for digital preservation, use the scripts in the following order. These scripts produce a digitally signed METS document in the parametrized folder 'workspace'.

For a short description about other optional arguments which are not listed here, see:

<scriptname> --help

Import digital objects and create general technical metadata

You can create technical metadata elements of a METS document from files located in the folder tests/data/structured followingly:

import-object 'tests/data/structured' --workspace ./workspace

You may use this script as many times as needed to import all your digital object. There are also lots of other options that can be given to this script in command line. See:

import-object --help

For information on provenance metadata created during the importing of digital objects, see the section on Provenance metadata in the packaging process below.

Create file format specific technical metadata

If your dataset contains image data, create MIX metadata for each of the image files:

create-mix path/to/images/image.tif --workspace ./workspace

ADDML metadata for a CSV file can be created by running:

create-addml path/to/csv_file.csv --workspace ./workspace --charset 'UTF8' --sep 'CR+LF' --quot '"' --delim ';'

A flag --header should be given if CSV file has headers. --sep flag defines the character used to separate records and --delim the character used to separate fields. --quot defines the quotation character used.

AudioMD metadata for an audio stream file can be created by running:

create-audiomd path/to/audio/audio.wav --workspace ./workspace

If a video container file contains audio stream data, the create_audiomd script above needs to be run for all audio streams in video files.

VideoMD metadata for a video stream file can be created by running:

create-videomd path/to/video/video.wav --workspace ./workspace

Call the scripts above for each file needed in your data set.

Create provenance metadata

An example how to create digital provenance metadata for a METS document. Values for the parameters --event_outcome and --event_type are predefined lists:

premis-event creation '2016-10-13T12:30:55' --workspace ./workspace --event_target 'tests/data/structured' --event_detail Testing --event_outcome success --event_outcome_detail 'Outcome detail' --agent_name 'Demo Application' --agent_type software

The argument --event_target is the object (file or directory) where the event applies. If the argument is not given, the target is the whole dataset. Do not use argument --event_target for directories, if the structural map is created based on EAD3 structure with compile_structmap.py. If argument --agent_name is not given, agent metadata is not created.

You may call this script several times to create multiple provenance metadata sections.

If several digital objects are linked to the same event and agent, use --event_target multiple times. You may also want to consider using --linking_object and --add_object_links in the following way:

premis-event --linking_object source pat/to/source_file --add_object_links ...

This will create an object link to the event with a given role source. --linking_object may be used several times. --event_target is same as using --linking_object with a role target. The role is stored only if --add_object_links is also used.

The helper script called create-agent can be used to create detailed agent metadata and to link several agents to the same event. If used, this helper script must be run before the premis-event script. This script will, unlike the other scripts, not produce ready XML data, but rather collect metadata to a JSON file. This JSON data is then passed to the premis-event script as an argument. An example how to use the script:

create-agent 'my software' --agent_type software --agent_version 1.0 --agent_role 'executing program' --create_agent_file 'my_event_1'

This will create an agent which is a software used to execute something. The '--agent_role' argument specifies the role of the agent in relation to the event and is used when linking the agent to the event. The required argument '--create_agent_file' is the name of the JSON file that collects the agent metadata. If multiple agents are created for the same event by running the create-agent script several times, they should all use the same value for the '--create_agent_file' argument. This value is then passed on to premis-event like this:

premis-event creation '2016-10-13T12:30:55' --workspace ./workspace --event_detail Testing --event_outcome success --event_outcome_detail 'Outcome detail'  --create_agent_file 'my_event_1'

The premis-event script will the create the actual XML data for every agent in the "my_event_1" JSON file and link the agent(s) to the event created by the script. Note that when the '--create_agent_file' argument is used, this will override any eventual agent information passed to the premis-event script by the arguments '--agent_name' and --agent_type'. The '--create_agent_file' value should be unique for each event, presuming that the events have different agents linked to them.

Add existing descriptive metadata

Script appends descriptive metadata into a METS XML wrapper. Metadata must be in an accepted format:

import-description 'tests/data/import_description/metadata/dc_description.xml' --workspace ./workspace --dmdsec_target 'tests/data/structured' --dmd_source 'my database' --dmd_agent 'database client' 'software' --remove_root

The argument '--remove_root' removes the root element from the given descriptive metadata. This may be needed, if the metadata is given in a container element belonging to another metadata format. If the argument is not given, the descriptive metadata is fully included. The argument '--dmdsec_target <target>' is the directory where the descriptive metadata applies. If the argument is not given, the target is the whole dataset. Do not use argument --dmdsec_target, if the structural map is created based on EAD3 structure with compile_structmap.py.

Currently importing multiple descriptive metadata files for the same --dmdsec_target is not supported. However, it is possible to add multiple descriptive metadata files, when each of these have different targets.

For information on provenance metadata created during the importing of descriptive metadata, see the section on Provenance metadata in the packaging process below.

Compile file section and structural map

The folder structure of a dataset is turned into files containing the file section and structural map of the METS document:

compile-structmap --workspace ./workspace

Optionally, the structural map can be created based on given EAD3 structure instead of folder structure, and here a valid EAD3 file is given with --dmdsec_loc argument:

compile-structmap --workspace ./workspace --structmap_type 'EAD3-logical' --dmdsec_loc tests/data/import_description/metadata/ead3_test.xml

Compile METS document and Submission Information Package

Compile a METS document file from the previous results:

compile-mets ch 'CSC' 'e48a7051-2247-4d4d-ae90-44c8ee94daca' --workspace ./workspace --copy_files --clean

The argument --copy_files copies the files to the workspace. The argument --clean cleans the workspace from the METS parts created in previous scripts.

Digitally sign the METS document:

sign-mets tests/data/rsa-keys.crt --workspace ./workspace

Create a TAR file:

compress ./workspace --tar_filename sip.tar

Adding native files to package with corresponding normalized files

A native file is an original file which is applicable only for bit-level preservation. Use the --bit_level flag to mark a file for bit-level preservation. The flag is not required if file-scraper is able to grade the file as a native file. Using the native file functionality requires a migrated file suitable for preservation and a normalization event. In this case the import-object script must be run before the premis-event script. Use the value normalization or migration as event type in premis-event. Here is the basic functionality:

import-object --bit_level ... path/to/native_file
import-object ... path/to/migrated_file
premis-event normalization ... --linking_object source path/to/native_file --linking_object outcome path/to/migrated_file --add_object_links
...

Sometimes a migration may be a combination of multiple source and/or outcome files. In such case, use import-object for each of them and create the migration event using --linking_object multiple times. For example combining two native files to one migrated file, do the following:

import-object ... path/to/native_file
import-object ... path/to/another_native_file
import-object ... path/to/migrated_file
premis-event migration ... --linking_object source path/to/native_file --linking_object source path/to/another_native_file --linking_object outcome path/to/migrated_file --add_object_links
...

We omit some of the required parameters above, for example timestamp or --event_detail. However, these parameters are still required.

For a native file, file format well-formedness validation is skipped in the import-object script.

Please note that importing native files in a submission information package for the Finnish National Digital Preservation Services requires acceptance from the service beforehand. If you are planning to use this feature, please contact the service for more information.

Provenance metadata in the packaging process

The Pre-Ingest Tool documents the packaging process by creating provenance metadata as PREMIS events and agents when running the scripts. The following scripts will produce provenance metadata when running them:

import-object
creates metadata extraction, validation, message digest calculation and format identification type events, depending on the arguments supplied to the script. This provenance metadata documents the creation of the technical metadata and the software used in that process
import-description
creates a metadata extraction type event, documenting the source of the descriptive metadata
compile-structmap
creates a creation type event, documenting the creation of the structural metadata

The script import-object has two arguments relating to provenance metadata, --event_target and --event_datetime. The first argument --event_target allows the provenance metadata to be linked to a specific part of the contents, for example the package root, regardless of the file path(s) given to the script. The second argument --event_datetime sets the timestamp of the event, which allows reusing the same provenance metadata each time import-object is run:

import-object 'tests/data/structured' --workspace ./workspace --event_datetime 2020-06-05 --event_target '.'

The example above allows import-object to be run multiple times for different file paths while still creating the provenance metadata only once with the timestamp 2020-06-05 and linking the provenance metadata to the package root .. This is also the default behaviour of the import-object script (timestamp of a current day without time and target link to package root).

Note that is highly recommended to use both arguments if import-object is run separately for each individual digital object in a package! By supplying the same values for these arguments each time the script is run all digital objects will link to the same provenance metadata in the METS document.

For documenting the source of the descriptive metadata, the script import-description has two arguments:, --dmd_source and --dmd_agent. These are used for documenting the source, e.g. database or system, for the descriptive metadata and the agent used to export the metadata from the source, e.g. a database client or API.

For a native file, validation type events are not created.

Including supplementary files in the package

The Pre-Ingest Tool supports adding supplementary files as part of the SIP. These supplementary files are files that are not part of the actual contents to be preserved, but are needed in order to document the contents in some way. These supplementary files are put in a separate METS fileGrp with a USE attribute value documenting the role of these files. A separate METS structMap is also created for these files.

The supplementary files must be valid files in a file format supported by the Digital Preservation Services. They are imported as normal digital objects by the import-object script. However, the option --supplementary is to be used when import these files to mark them as supplementary:

import-object 'tests/data/text-file.txt' --workspace ./workspace --supplementary xml_schema

Currently, the only supplementary type supported is "xml_schema".

Mapping XML schema files in the package

XML schema files that are added to the SIP as supplementary files must be mapped to the schemaLocation or noNamespaceSchemaLocation values in the XML contents. This is done by running the script define-xml-schemas. This script will create a PREMIS representation type object containing all the mapped values to the schema files. The script is given a pair of URI reference, corresponding to the schemaLocation or noNamespaceSchemaLocation value, and path to the schema file, as a relative path, by using the required --uri_pairs option:

define-xml-schemas --uri_pairs http://localhost/my_schema.xsd file://schemas/my_schema.xsd --workspace ./workspace

The --uri_pairs option is repeatable for all schemas to be included in the SIP.

Note that these schema files have also to be imported as digital objects with the import-object script and with using the --supplementary option to mark them as supplementary.

Additional notes

This software is able to collect metadata and check well-formedness of a limited set of file formats. Please see the file-scraper repository for more information.

The Pre-Ingest Tool does not support well-formedness checks of the following file formats:

  • text/csv file
  • text/xml file against XML schema or schematron files

Should you append these files to your workspace, use the --skip_wellformed_check argument on them.

Installation using Python Virtualenv for development purposes

Packages python3-devel, openssl-devel, swig and gcc are required in your system to install M2Crypto, which is used for signing the packages with digital signature.

Create a virtual environment:

python3 -m venv venv

Run the following to activate the virtual environment:

source venv/bin/activate

Install the required software with commands:

pip install --upgrade pip==20.2.4 setuptools
pip install -r requirements_github.txt
pip install .

See the README from file-scraper repository for additional installation requirements: https://github.com/Digital-Preservation-Finland/file-scraper/blob/master/README.rst

To deactivate the virtual environment, run deactivate. To reactivate it, run the source command above.

Copyright

Copyright (C) 2018 CSC - IT Center for Science Ltd.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.