/MIK_workshop_Islandoracon_2017

Islandoracon 2017 Post-Conference Session "Move to Islandora Kit: For all your migration and batch loading preparation needs"

DOI

MIK workshop at Islandoracon 2017

Islandoracon 2017 Post-Conference Session "Move to Islandora Kit: For all your migration and batch preparation needs"

Workshop overview

  • Your instructors
    • Marcus Barnes, The Digital Scholarship Unit (DSU) at the University of Toronto Scarborough Library
    • Mark Jordan, Simon Fraser University Library
  • Duration: 3 hours
    • First half hour: installing MIK
    • Next 2 hours: hands on exercises
    • Last half hour: discussion
  • Outcomes
    • Use MIK to create Islandora import packages from CSV metadata
    • Use MIK to create Islandora import packages from data harvested via OAI-PMH

MIK in a nutshell

MIK is an example of an "Extract, Transform, Load" application. However, it only extracts and transforms. It delegates the loading of content into Islandora to the existing batch modules. In other words, MIK prepares and transforms your content into ingest packages ready to hand off to the batch modules for loading into Islandora:

MIK overview

Toolchains

MIK is designed to be flexible, configurable, and extensible. It achieves these goals by breaking down the parts of the "extract and transform" process into components. Each set of components are known as a "toolchain". You describe the components, and their configuration options, in an .ini file. When you run MIK, you tell it which .ini file to use. There is a CONTENTdm Books toolchain, a CSV Single File toolchain, and several OAI-PMH toolchains.

Internally, MIK breaks the task of converting the input data into Islandora import packages down into discrete subtasks as illustrated in the diagram below:

MIK details

  • Fetchers query a data source to determine how many objects are to be imported, and perform some additional setup for the subsequent tasks.
  • Fetcher manipulators filter out items from the entire set of data retrieved by the fetcher. For example, you may only want to fetch book objects from a CONTENTdm collection that also contains images.
  • Metadata parsers get the metadata for an object and convert it into a format that Islandora can use, such as MODS XML.
  • File getters retrieve the content files associated with an object to be imported.
  • File getter manipulators provide a way to configure file getters to look in specific locations for files.
  • Writers save the converted content to disk in a directory structure that can be used by the standard Islandora batch import modules. After a writer has written out its package, it can initiate one or more post-write hooks (described below) that perform actions on the content in the packages.
  • File manipulators perform some processing on the files retrieved by file getters.
  • Metadata manipulators can modify or supplement the metadata XML file generated by metadata parsers.

Manipulators

Manipulators are MIK plugins that let you perform tasks at specific times in the MIK execution lifecycle, or to change how fetchers, file getters, and metadata parsers work. All the code for a manipulator is encapsulated in a single PHP class file. Manipulators are registered in the MIK configuration file in the [MANIPULATORS] section, and may take parameters. The signatures for manipulators identify the group they are in, followed by an equal sign, followed by the manipulator's parameters, which are delimited by the pipe symbol (|). The first parameter is always the name of the manipulator. For example, in the following example, the "NormalizeDate" manipulator is being registered, taking the parameters "Date", "dateIssued", and "m":

metadatamanipulators[] = "NormalizeDate|Date|dateIssued|m"

The most commonly used manipulators include:

Type Manipulator Function Toolchains
Fetcher SpecificSet Limits objects to those named in a list. CSV, CONTENTdm
Fetcher RandomSet Limits objects to a random set of a specific size. CSV, CONTENTdm
Fetcher CdmSingleFileByExtension Limits objects to those with files of specific extensions. CONTENTdm
Metadata SplitRepeatedValues Splits values in a single field into separate MODS elements. CSV, CONTENTdm
Metadata NormalizeDate Converts dates into yyyy-mm-dd or yyyy-mm. CSV, CONTENTdm
Metadata SimpleReplace Search and replace strings in MODS elements. CSV, CONTENTdm
Metadata InsertXmlFromTemplate Generates MODS XML fragments from external templates. CSV, CONTENTdm

Each manipulator has its own wiki page, which explains its function and parameters. Below is an overview of the most popular manipulators.

Some MIK use cases

  • Migrating from another repository
  • Preparing content for ingestion into Islandora
    • MIK can read a CSV metadata file and generate Islandora import packages for each object described in it. You can create packages for images/PDFs/videos, books, newspaper issues, and simple compound objects from CSV files.
  • Automated ingestion workflows
    • MIK has been scripted to convert content saved in watch folders into Islandora import packages. Running MIK as a timed (e.g., cron) job on this content will allows you to automate content ingestion.

MIK's documentation

MIK's wiki is the chief source of documentation. Sections we'd like to hightligh are:

  • Toolchain documentation
    • Detailed guides to configuringa using the various CSV, CONTENTdm, and OAI-PMH toolchains.
  • Manipulator documentation
    • Detailed guides to configuringa using MIK's manipulators.
  • The MIK Cookbook
    • A set of short "how to" recipes documenting how to accomplish specific tasks using MIK.
  • The MIK tutorial
    • A self-paced tutorial that takes you through the process of generating Islandora import packages for a set of five photos.
  • Migration guides
    • Several detailed guides exist describing how to use MIK to migrate from repository platforms such as CONTENTdm and Digital Commons.

Installing MIK

  • PHP 5.5.0 (or higher) command-line interface (CLI)
  • Composer
    1. curl -sS https://getcomposer.org/installer | php
    2. php composer.phar install

Configuring MIK

MIK uses .ini files to store configuration details. An MIK .ini file contains the following sections:

[SYSTEM]
; This section is used to define PHP configuration options required
; by MIK but not defined in the system's php.ini file. Not necessary
; on all systems.

[CONFIG]
; Contains information about the .ini file and toolchain.

[FETCHER]
; Contains configuration options for the fetcher, which gets a list
; of items to process.

[METADATA_PARSER]
; Contains configuration options for the metadata parser, which converts
; data in the input list to MODS.

[FILE_GETTER]
; Contains configuration options for the file getter, which retrieves the file
; (image, PDF, video, etc.) to include in the Islandora import package.

[WRITER]
; Contains configuration options for the writer, which writes the import packages
; to the output directory.

[MANIPULATORS]
; Contains entries for manipulators, which are MIK's plugins. Most manipulator
; entries contain parameters.

[LOGGING]
; Contains paths to the log files that MIK creates.

Each section contains configuration options for the components of a toolchain.

MIK works well on Linux, OSX, and Windows. The one place where .ini files differ across platforms is how file paths are expressed. Most .ini files contains File system paths for the location of the input CSV file (for CSV toolchains), the temporary directory, the metadata mappings file, and the ouput directory. For example:

[FILE_GETTER]
; On Linux
; input_directory = "/home/mark/Downloads/mik_tutorial_data"
; On Mac
; input_directory = "/Users/mark/mik_tutorial_data"
; On Windows
; input_directory = "c:\temp\mik_tutorial_data"

Metadata mappings

The CONTENTdm and CSV toolchains use a mapping file to define what input field or column names map to specific MODS elements. Mapping files look like this:

Title,<titleInfo><title>%value%</title></titleInfo>
Author,<name type='personal'><namePart>%value%</namePart><role><roleTerm type='text'>creator</roleTerm></role></name>
Date,<originInfo><dateIssued encoding='w3cdtf'>%value%</dateIssued></originInfo>
Subjects,<subject><topic>%value%</topic></subject>
Identifier,<identifier type='local' displayLabel='Local identifier'>%value%</identifier>
null0,<genre authority='marcgt'>article</genre>
null1,<typeOfResource>text</typeOfResource>

You can create them:

The Metadata Mappings Helper is a simple Google Sheets application that provides a drop-down list of common MODS snippets.

MIK Metadata Mappings Helper screenshot

Post-write hooks

MIK can run scripts after it has written an import package to disk. These scripts are called "post-write hooks" and are enabled in the .ini file's [WRITER] section like this:

[WRITER]
...
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/validate_mods.php"
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/generate_fits.php"
postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/object_timer.php"
; postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/sample.php"
; postwritehooks[] = "/usr/bin/python extras/scripts/postwritehooks/sample.py"

The three scripts listed above are included in the MIK Github repository as examples. The ones named 'sample' illustrate some basic ways of using post-write hooks. Three complete functional scripts, extras/scripts/postwritehooks/validate_mods.php, extras/scripts/postwritehooks/generate_fits.php, and extras/scripts/postwritehooks/object_timer.php do useful things, as suggested by their names.

A good example of how a post-write hook script can be used is to produce FITS output for newspaper page objects. As soon as MIK finishes creating the package, the generate_fits.php script runs FITS against each of the child OBJ datastream files and writes out its output for each one to TECHMD.xml within the page folder. This file is then loaded by the Islandora Newspaper Batch module, ending up as the TECHMD datastream for each page object. Another very useful example is validate_mods.php, which validates each MODS.xml file produced by MIK and writes out the result of the validation to the MIK log file.

Post-write hook scripts run as background processes, which means that they do not need to finish running before MIK moves on to process the next object. This speeds up MIK considerably.

Shutdown hooks

MIK can run scripts after it has completed processing all packages. These scripts are called "shutdown hooks" and are enabled in the .ini file's [WRITER] section like this:

[WRITER]
shutdownhooks[] = "/usr/bin/php extras/scripts/shutdownhooks/apply_xslt_with_saxon.php"
shutdownhooks[] = "/usr/bin/php extras/scripts/shutdownhooks/delete_temp_files.php"

These scripts differ from post-write hooks, which run after MIK writes the import package for each object, in that they run after all packages have been generated. They are useful for cleanup tasks, for example.

MIK's log files

MIK writes out log files to record errors or warnings, and to document actions it takes while processing your content. The log files contain information that will help you track down problems, and also contain information that will let you decide if the various components of MIK, particularly its metadata manipulators, are working the way you expect.

MIK's log files are particularly useful when you are testing various configuration options, before you run MIK that one final time to generate your ingest packages for loading into Islandora. For example, the input_validators.log will point out missing or unexpected files in your input, a problem you will want to rectify before generating your packages for loading.

Log file Purpose
mik.log Logs all warnings and errors encountered by fetchers, filegetters, metadata parsers, and writers are logged here.
problem_records.log If any errors occur that block or interfere with the creation of an ingest package, the problem records log will contain entries documenting the record keys of the problematic objects.
input_validator.log If MIK's input validators detect any problems, they write log entries like this one to the input validation log describing why validation failed.
manipulator.log Contains entries that document what various manipulators are doing. The most common type of entry in this log is generated by metadata manipulators.

The Interpreting MIK's log files entry in the MIK Cookbook provides detailed information on how to use log files.

Running MIK

  • php mik -c test.ini -cc all
  • php mik -c test.ini -l 10

Workflow

  • configure, test (random set, specific set, etc.), reconfigure, retest.
    • start with metadata mappings
    • manipulators
    • post-write hooks
  • automating production workflows

Hands-on activities

Creating Islandora import packages from CSV data

Outcomes

Background

The CSV input file:

Identifier,File,Title,Author,Date,Subjects
doc01,the_documentary.pdf,"One: The Documentary","Ji-Hu Maruška",2014,"Numbers;Documentaries"
doc02,case_study.pdf,"2: A Case Study","Iman Valenta",2012,"Case studies;Duology"
doc03,user_manual.pdf,"3: The User Manual","Hanne Darzi",2011,"Instructional manuals"
doc04,any_way.pdf,"4: Any Way You Want It","Arya Kovac",2014,"Quarterly studies;Journey (band)"
doc05,best_friend.pdf,"5: Everybody's Best Friend","Nuka Kratochvil",2001,"Dogs"

The mappings file:

Title,<titleInfo><title>%value%</title></titleInfo>
Author,<name type='personal'><namePart>%value%</namePart><role><roleTerm type='text'>creator</roleTerm></role></name>
Date,<originInfo><dateIssued encoding='w3cdtf'>%value%</dateIssued></originInfo>
Subjects,<subject><topic>%value%</topic></subject>
Identifier,<identifier type='local' displayLabel='Local identifier'>%value%</identifier>
null0,<genre authority='marcgt'>article</genre>
null1,<typeOfResource>text</typeOfResource>

The .ini file:

; MIK configuration file used during the workshop.

[SYSTEM]

[CONFIG]
config_id = MIK workshop
last_updated_on = "2017-03-20"
last_update_by = "mj"

[FETCHER]
class = Csv
input_file = "data/metadata.csv"
temp_directory = "/tmp/mik_workshop_temp"
record_key = Identifier

[METADATA_PARSER]
class = mods\CsvToMods
mapping_csv_path = "data/mappings.csv"

[FILE_GETTER]
class = CsvSingleFile
input_directory = data
temp_directory = "/tmp/mik_workshop_temp"
file_name_field = File

[WRITER]
class = CsvSingleFile
output_directory = "/tmp/mik_workshop_output"
; postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/validate_mods.php"
; During testing, it's a good idea to create only the MODS XML files.
; datastreams[] = "MODS"

[MANIPULATORS]
metadatamanipulators[] = "SplitRepeatedValues|Subjects|/subject/topic|;"

[LOGGING]
path_to_log = "/tmp/mik_workshop_output/mik.log"
path_to_manipulator_log = "/tmp/mik_workshop_output/manipulator.log"

Steps required to achieve the outcome

1. Create your mappings file
2. Create your .ini file
3. Test
4. When ready, generate your ingest packages

Creating Islandora import packages from data harvested from another repository via OAI-PMH

Outcome

  • Generate a set of import packages from objects in an OAI-PMH repository.

Background

OAI-PMH is commonly used for harvesting metadata for aggregated searching or similar purposes. MIK can use a repository's OAI metadata to serve as the basis for creating Islandora import packages. MIK offers a couple of OAI-PMH toolchains. In this workshop, we'll be using one that fetches objects from an Islandora instance.

Why harvest content from one Islandora instance to load into another? There are some legitimate reasons to do this, but in this workshop, we do it for illustrative purposes only. Islandora has a robust OAI-PMH provider, and since many Islandora instances implement it, it serves as a useful learning environment. Outside of this workshop, you might use MIK's OAI-PMH harvesting abilities to migrate from a Digital Commons or Vital repository, for example.

The .ini file:

; MIK configuration file for an OAI-PMH toolchain.

[CONFIG]
config_id = oaitest
last_updated_on = "2017-03-20"
last_update_by = "mj"

[SYSTEM]

[FETCHER]
class = Oaipmh
oai_endpoint = "http://digital.lib.sfu.ca/oai2"
set_spec = hiv_collection
temp_directory = "/tmp/oaitest_temp"

[METADATA_PARSER]
class = dc\OaiToDc

[FILE_GETTER]
class = OaipmhIslandoraObj
temp_directory = "/tmp/oaitest_temp"

[WRITER]
class = Oaipmh
output_directory = "/tmp/oaitest_output"
datastream_ids[] = OBJ
datastream_ids[] = PDF

[MANIPULATORS]

[LOGGING]
path_to_log = "/tmp/oaitest_output/mik.log"
path_to_manipulator_log = "/tmp/oaitest_output/manipulator.log"

Steps required to achieve the outcome

1. Create your .ini file
  • Find a small image collection in an Islandora repository that implements the OAI-PMH provider.
2. Test
3. When ready, generate your ingest packages

License

Creative Commons License
This workshop material is licensed under a Creative Commons Attribution 4.0 International License.