Command-line tool to ingest Islandora objects using Islandora's REST interface.
- On the target Islandora instance
- Islandora REST
- Islandora REST Authen
- Optionally, Islandora REST Extras (see "Generating DC XML" below for more information).
- On the system where the script is run
- PHP 5.5.0 or higher.
- Composer
git clone https://github.com/mjordan/islandora_rest_ingester.git
cd islandora_rest_ingester
php composer.phar install
(or equivalent on your system, e.g.,./composer install
)
Islandora Batch, Islandora Book Batch, and Islandora Newspaper Batch are the standard go-to tools for ingesting large amounts of content into Islandora. Other batch ingest modules also exist, such as Islandora Compound Batch. The command-line interfaces to these tools enable ingestion of thousands of objects at a time and also allow for scripted ingests, for example in automated workflows. But, they all need to be run as drush
commands on the Islandora server.
The Islandora REST Ingester offers the ability to ingest content from any location that has HTTP access to your Islandora server. Some use cases for this ability include:
- the content is prepared by external partners (service providers, other libraries, etc.) and you want to allow them to ingest that content
- for security policy reasons, it is problematic to have people logging into your Islandora server to run
drush
commands - during batch ingest, you will need to have enough disk space on your Islandora server for both the raw input data and the copies in Islandora created during ingestion (in other words, double the disk space taken up by your content)
- in automated ingestion workflows, moving content from where it is being digitized and processed to the filesystem of your Islandora server is problematic
Secondarily, ingestion tools that use Islandora's REST interface demonstrate the potential for the creation of desktop tools with graphical user interfaces (!) for ingesting content into Islandora, and for thinking about strategies and tools for batch ingesting content into Islandora CLAW, which has a REST interface.
One significant advantage that the drush
-based batch modules have over the Islandora REST Ingester is that they can ingest datastream files that exceed the Islandora server's maximum file upload setting. This setting is configurable but has practical limits. Because the Islandora REST Ingester ingests objects over HTTP, it is also succeptible to this maxiumum file size.
The Islandora REST Ingester provides an option, --max_file_size
, that will skip ingesting any datastream above the specified number of megabytes. All datastreams skipped for this reason are logged.
The REST Ingester does not use Islandora Batch's batch sets.
Currently, this tool ingests single-file Islandora objects (basic and large image, PDF, video, etc.), collection objects, compound objects, book objects, and newspaper issue objects (not newspaper objects).
Single-file objects include all content models that have no child objects. To prepare your content for ingesting, within the input directory, create subdirectories for each object. Within each, put a MODS.xml file and the file intended to be the OBJ datastream. This file should be named 'OBJ' and have whichever extension is appropriate for its content. Subdirectories that do not contain a MODS.xml file are skipped:
sampleinput/
├── foo
│ ├── MODS.xml
│ └── OBJ.png
├── bar
│ ├── MODS.xml
│ └── OBJ.jpg
├── empty
└── baz
├── MODS.xml
├── TN.png
└── OJB.jpg
You may add whatever additional datastream files you want to the object directories. For example, if you want to pregenerate FITS output for each object, you can add TECHMD.xml
and it will be ingested as the TECHMD datastream. Another common use for ingesting pregenerated datastream files is custom thumbnails.
If a datastream already exists (for example, a TN created as a derivative), and there is a datastream file in the input directory that would otherwise trigger the ingestion of the datastream, the datastream's content is updated from the file. The check for the existence of the datastream is logged (HTTP response code 200 if it exists, 404 if it does not).
For compound objects, each parent object should be in its own directory, and within that directory, each child should be in its own subdirectory. The sequence of the children within the compound is determined by the numbering of the child subdirectories:
input/
├── foo
│ ├── 1
│ │ ├── MODS.xml
│ │ └── OBJ.jpg
│ ├── 2
│ │ ├── MODS.xml
│ │ └── OBJ.jpg
│ └── MODS.xml
└── bar
├── 1
│ ├── MODS.xml
│ └── OBJ.tif
├── 2
│ ├── MODS.xml
│ ├── cmodel.txt
│ └── OBJ.bin
└── MODS.xml
In this example, the file 'cmodel.txt' contains the PID of the content model to assign to the child object (see "Specifying the content model" below for more information).
Each book object should be in its own directory, and within that directory, each page should be in its own subdirectory. The sequence of the pages within the book (and the labels of page objects) is determined by the numbering of the page subdirectories:
input/
├── foo
│ ├── 1
│ │ └── OBJ.tiff
│ ├── 2
│ │ └── OBJ.tiff
│ ├── 3
│ │ └── OBJ.tiff
│ ├── 4
│ │ └── OBJ.tiff
│ └── MODS.xml
└── bar
├── 1
│ └── OBJ.tiff
├── 2
│ └── OBJ.tiff
├── 3
│ └── OBJ.tiff
├── 4
│ └── OBJ.tiff
└── MODS.xml
Page directories can contain OCR.txt files or any other datastream files. If a MODS.xml file is present, the page's label is taken from its title elemennt.
Newspaper issues are arranged the same way as books. Each issue should be in its own directory, and within that directory, each page should be in its own subdirectory. The sequence of the pages within the issue (and the labels of page objects) is determined by the numbering of the page subdirectories:
input/
├── foo
│ ├── 1
│ │ └── OBJ.tiff
│ ├── 2
│ │ └── OBJ.tiff
│ ├── 3
│ │ └── OBJ.tiff
│ ├── 4
│ │ └── OBJ.tiff
│ └── MODS.xml
└── bar
├── 1
│ └── OBJ.tiff
├── 2
│ └── OBJ.tiff
├── 3
│ └── OBJ.tiff
├── 4
│ └── OBJ.tiff
└── MODS.xml
Page directories can contain OCR.txt files or any other datastream files. If a MODS.xml file is present, the page's label is taken from its title elemennt.
php ingest.php [options] INPUT_DIR
For example,
php ingest.php -l mylog.log -e http://localhost:8000/islandora/rest/v1 -m islandora:sp_basic_image -p rest:collection -n rest -o admin -u admin -t admin testinput
INPUT_DIR
Required. Ablsolute or relative path to a directory containing Islandora import packages. Trailing slash is optional.
-e/--endpoint <argument>
Fully qualified REST endpoing for the Islandora instance. Default is http://localhost/islandora/rest/v1.
-m/--cmodel <argument>
Required. PID of the object's content model.
-n/--namespace <argument>
Object's namespace. If you provide a full PID, it will be used. If you do not include this option, the ingester assumes that each object-level input directory encodes the object PIDs, and will ingest objects using those PIDs.
-o/--owner <argument>
Required. Object's owner.
-s/--state <argument>
Object state. Default is A (active). Allowed values are I (inactive) and D (deleted).
-p/--parent <argument>
Required. PID of the object's parent collection, book, newspaper issue, compound object, etc.
-r/--relationship <argument>
Predicate describing relationship of object to its parent. Default is isMemberOfCollection.
-c/--checksum_type <argument>
Checksum type to apply to datastreams. Use "none" to not apply checksums. Default is SHA-1.
-z/--max_file_size <argument>
Maximum size, in MiB, of datastream files to ingest. If a file is larger than this, its datastream is not ingested. Default is 500 MiB.
-l/--log <argument>
Path to the log. Default is ./rest_ingest.log
-g/--plugins <argument>
A comma-separated list of plugin names.
-t/--token <argument>
Required. REST authentication token.
-u/--user <argument>
Required. REST user name.
-d/--delete_input
Whether or not to delete the input files for an object after they have been successfully ingested.
--help
Show the help page for this script.
Please note:
- when ingesting compound objects, the value of the
--cmodel
option should be "islandora:compoundCModel". - when ingesting books, the value of the
--cmodel
option should be "islandora:bookCModel". - when ingesting newspaper issues, the value of the
--cmodel
option should be "islandora:newspaperIssueCModel", and the value of the--parent
option should be the PID of the newspaper object. You do not need to include the--relationship
argument. - operating system junk files 'Thumbs.db' and 'DS_Store' are ignored.
The log file records when the Islandora REST Ingester was run, what objects and datastreams it ingested, and checksum verifications (if checksums were enabled on datastreams). It also records any exceptions thown during REST requests:
[2017-07-17 07:12:35] Islandora REST Ingester.INFO: ingest.php (endpoint http://localhost:8000/islandora/rest/v1) started at July 17, 2017, 7:12 am [] []
[2017-07-17 07:12:35] Islandora REST Ingester.WARNING: /home/mark/Documents/hacking/islandora_rest_scripts/ingest_islandora_objects_via_rest/testinput/bar appears to be empty, skipping. [] []
[2017-07-17 07:12:35] Islandora REST Ingester.INFO: Object rest:172 ingested from /home/mark/Documents/hacking/islandora_rest_scripts/ingest_islandora_objects_via_rest/testinput/baz [] []
[2017-07-17 07:12:36] Islandora REST Ingester.INFO: Object rest:172 datastream MODS ingested from /home/mark/Documents/hacking/islandora_rest_scripts/ingest_islandora_objects_via_rest/testinput/baz/MODS.xml [] []
[2017-07-17 07:12:36] Islandora REST Ingester.INFO: SHA-1 checksum for object rest:172 datastream MODS verified. [] []
[2017-07-17 07:13:37] Islandora REST Ingester.INFO: Object rest:172 datastream OBJ ingested from /home/mark/Documents/hacking/islandora_rest_scripts/ingest_islandora_objects_via_rest/testinput/baz/OBJ.png [] []
[2017-07-17 07:13:37] Islandora REST Ingester.INFO: SHA-1 checksum for object rest:172 datastream OBJ verified. [] []
[2017-07-17 07:13:38] Islandora REST Ingester.INFO: Object rest:173 ingested from /home/mark/Documents/hacking/islandora_rest_scripts/ingest_islandora_objects_via_rest/testinput/foo [] []
[2017-07-17 07:13:38] Islandora REST Ingester.INFO: Object rest:173 datastream MODS ingested from /home/mark/Documents/hacking/islandora_rest_scripts/ingest_islandora_objects_via_rest/testinput/foo/MODS.xml [] []
[2017-07-17 07:13:38] Islandora REST Ingester.INFO: SHA-1 checksum for object rest:173 datastream MODS verified. [] []
[2017-07-17 07:13:48] Islandora REST Ingester.INFO: Object rest:173 datastream OBJ ingested from /home/mark/Documents/hacking/islandora_rest_scripts/ingest_islandora_objects_via_rest/testinput/foo/OBJ.jpg [] []
[2017-07-17 07:13:48] Islandora REST Ingester.INFO: SHA-1 checksum for object rest:173 datastream OBJ verified. [] []
[2017-07-17 07:13:48] Islandora REST Ingester.INFO: ingest.php finished at July 17, 2017, 7:13 am [] []
You can specify the location of the log file with the -l
option. If there are any error entries in your log, the REST Ingester will tell you, but it won't inform you of other types of log entries.
The --cmodel
option tells the ingest.php script which ingester class to invoke for each object in the input directory. A default (paged) content model is applied to pages in books and newspaper issues, and the content model for each child element in a compound object is assigned based on the OBJ datastream file's extension. If the content model cannot be assigned from the extension, the child object is not ingested.
There are situations where you may want to assign an object's content model explicitly. For example, some content models do not use OBJ datastreams, such as islandora:entityCModel and islandora:personCModel. Some solution packs do not rely on a specific set of file extensions to define their OBJ content models, such as the Binary Object Solution Pack.
The content model for any object can be overridden by the presence of a file called 'cmodel.txt' within the object directory. This file contains the PID of the desired content model. See the example in sampledata/single/binary/cmodel.txt
, which contains
islandora:binaryObjectCModel
This content model is used instead of the one provided in the --cmodel
option.
All Fedora objects are assigned a default DC datastream that contains only the object label and its PID. Islandora generates richer DC XML from the MODS (or other XML) datastream either via XML Forms if the object is ingested using the Web interface or via one of the batch ingest modules. Islandora REST bypasses both, so objects ingested via REST only get the default Fedora DC XML datastream.
To generate DC from MODS or another XML datastream, install and enable the Islandora REST Extras module.
All relationships defining content models, collection membership, and parent/page or parent/child relationships are added to objects automatically, but additional relationships can be added to objects by specifying them in a file named "relationships.json" within the object-level input directory. The relationships are expressed in a JSON structure like this:
{
"relationships": [
{
"uri": "info:fedora/fedora-system:def/relations-external#",
"predicate": "isMemberOfCollection",
"object": "myother:collection",
"type": "uri"
},
{
"uri": "info:fedora/fedora-system:def/relations-external#",
"predicate": "isMemberOfCollection",
"object": "yetanother:collection",
"type": "uri"
}
]
}
This relationships.json file will add the object to two additional collections, myother:collection
and yetanother:collection
.
The Islandora REST interface allows you to provide a full PID when ingesting an object, allowing us to replace/restore objects. This is not an update operation; if an object with the specified PID exists, it must be purged before the PID can be reused.
If you omit the --namespace
option, the Ingester assumes that each object-level directory encodes the PID it should use when ingesting the object. Directory names should be the same as the PID, e.g. test:245
. If your PIDs contain characters that may not be safe in filenames (for example, :
on Windows), you can URL-endcode them (e.g., test%3A245
); the Ingester will automatically decode them to get the PID.
Note that this only works for top-level objects in the input directory; pages, and children of compound objects, cannot reuse PIDS.
Changing our examples above so that the object directories encode PIDs would look like this:
pidsample/
├── foo:1
│ ├── MODS.xml
│ └── OBJ.png
├── bar:1
│ ├── MODS.xml
│ ├── foxml.xml
│ └── OBJ.jpg
└── baz:1
├── MODS.xml
├── TN.png
└── OJB.jpg
URL-encoding the directory names as foo%3A1
, bar%3A1
, etc. would be valid as well.
The ingest command should omit the --namespace
option. For example, the following command will ingest the three objects in the above sample directory and assign each the PID encoded in the object-level directory:
php ingest.php -l mylog.log -e http://localhost:8000/islandora/rest/v1 -m islandora:sp_basic_image -p test:collection -o admin -u admin -t admin pidsample
Note that the restored object's owner, label, and state are assigned like they are for any other ingested object. However, if a 'foxml.xml' file is present in the object's input directory (like in the 'bar:1' object above), the owner, label, and state are taken from it.
Plugins allow you to perform actions just before objects are packaged up for ingestion. Currently, the only plugin available is the CreateModsStub
plugin, which generates a very basic MODS.xml file in each object directory if none is already present. This MODS file uses the object directory's name as the object title. To run this plugin, include -g CreateModsStub
in your command.
Multiple plugins can be invoked by specifying their names in a comma-separated list in the -g option, for example -g Foo,Bar
. In this example, if the plugin files for both Foo
and Bar
plugins exist, code in those files would be executed in that order.
The ingester looks for plugin files in the includes
directory. Each plugin file is a PHP class file named [classname].plugin.php
and if found, instantiate the plugin's class. Each plugin class has one required method, execute()
, and inherits the current object directory, Monolog logger, and Commando command. See includes/Example.plugin.php
and includes/CreateModsStub.plugin.php
for examples.
You can extend this tool to ingest objects that have content models not already represented, or override the default functionality, by doing the following:
- mapping a content-model to an Ingester class
- writing a PHP class that extends
islandora_rest_client\ingesters\Ingester
You can define custom mappings between content models and Ingester classes in a file named cmodel_classmap.txt
in the same directory as ingest.php
. This file should contain one mapping per line, and each line should have two columns separated by a tab. In the left column is the content model PID and in the right column is the class name:
islandora:foo MyIngester
islandora:bar Example
Custom Ingester class files must be placed in the includes
directory. Example Ingesters are provided at includes/Example.php
and includes/ESIngester.php
. After you put new class files in the includes
directory, be sure to run composer dump-autoload
to update the application's classmap.
Move to Islandora Kit's output can be used as the REST Ingester's input, except for its output for single-file objects. However, MIK can be configured to output single-file objects in the required format as follows:
- copy
extras/MIK/repackage_for_rest_ingester.php
to MIK's post-write hook script directory (extras/scripts/postwritehooks
) - register the script in your MIK .ini file's
[WRITER]
section as you would any other post-write hook script:postwritehooks[] = "/usr/bin/php extras/scripts/postwritehooks/repackage_for_rest_ingester.php"
If you would rather not copy the script to the MIK directory, provide a full path in the .ini file entry to its location.
The Islandora Import Package QA Tool can validate the REST Ingester's input. Since the REST Ingester's input for single-file objects differs from Islandora Batch's, iipqa uses a custom value for its --content_model
option, single_rest_ingester
. Also, when validating compound objects, include the --skip_structure
option.
The Islandora REST Ingester works well within scripted jobs. For example, you could schedule the script below to run overnight, in order to ingest newspaper issues prepared during the previous day. In this example, the ingest packsges are produced by the Move to Islandora Kit, they are then validated by the Islandora Ingest Package QA Tool, and finally, are ingested useing the REST Ingester. If either MIK or the iipqa fail, the script exits before the Ingester in run.
#!/bin/bash
#######################################################################
# Sample bash script to automate ingestion of content into Islandora. #
# using the Move to Islandora Kit, Islandora Ingest Package QA Tool, #
# and the Islandora REST Ingester. #
# #
# Usage: ./sample_scripted_workflow.sh #
#######################################################################
# 'set -e' tells the shell script to stop running if any commands
# within it exit with a non-0 value.
set -e
# Change into the MIK directory and run MIK. The .ini file includes
# tells MIK to write its output to /tmp/sample_packages. Also,
# we run MIK in 'realtime' input validation mode, so it skips
# packages with malformed input.
cd /path/to/mik
php mik -c sample_config.ini
# Delete log files, or better yet move them somewhere for analysis
# in case something goes wrong.
rm /tmp/sample_packages/*.log
# Change into the Islandora Import Package QA Tool and run it.
# We add the --strict option so it exists with 1 if any packages
# have errors. We tell it this so the the next step, running
# drush to ingest the content, does not happen.
cd /path/to/iipqa
php iipqa --strict -m newspapers -l /tmp/sample_iipaq.log /tmp/sample_packages
# Change to the Islandora REST Ingester directory and run it.
cd /path/to/rest_ingester
php ingest.php -l mylog.log -e http://localhost:8000/islandora/rest/v1 -m islandora:newspaperIssueCModel -p my:newspaper -n mynamespace -o admin -u admin -t admintoken /tmp/sample_packages
The directory sampledata
provides samples that are intended to illustrate how input should be arranged, and to let you try ingesting objects quickly. All objects are from Simon Fraser University's Islandora instance at http://digital.lib.sfu.ca; a few are concocted, such as the same binary object.
- single file objects: to ingest these three objects (two editorial cartoons and one binary object), run the command
php ingest.php -l mylog.log -e http://localhost:8000/islandora/rest/v1 -m islandora:sp_large_image_cmodel -p restingester:collection -n mynamespace -o admin -u admin -t admintoken sampledata/single
- compound objects: to ingest these two objects (two postcards), run
php ingest.php -l mylog.log -e http://localhost:8000/islandora/rest/v1 -m islandora:compoundCModel -p restingester:collection -n mynamespace -o admin -u admin -t admintoken sampledata/compound
- book: to ingest the sample book (there is only one, and to reduce the size of the sample data it only contains pages 1-4 and 17-19), run
php ingest.php -l mylog.log -e http://localhost:8000/islandora/rest/v1 -m islandora:bookCModel -p restingester:collection -n mynamespace -o admin -u admin -t admintoken sampledata/book
- newspaper issues: to ingest the two sample newspaper issues, create a newspaper object and run the command
php ingest.php -l mylog.log -e http://localhost:8000/islandora/rest/v1 -m islandora:newspaperIssueCModel -p my:newspaper -n mynamespace -o admin -u admin -t admintoken sampledata/newspaper
- If you discover a bug, or have a use case not documented here, open an issue.
- If you want to open a pull request, open an issue first.
- By opening a pull request, you agree to placing your contribution into the public domain.
- Check code style with
composer style
- If the work in your PR can be tested, write tests for it. Even if it can't, run
composer tests
to make sure you didn't break anything. - Use the pull request template.
The Unlicense