/get_islandora_content

Script to harvest Islandora objects via OAI-PMH.

Primary LanguagePHPThe UnlicenseUnlicense

Get Islandora Content

Command-line tool to harvest Islandora objects through OAI-PMH and save them to disk ready to ingest into Drupal 8 using Migrate Plus. More of a proof of concept than anything else, but works as advertised. For background, see Islandora/documentation#452.

Requirements

  • On the target Islandora instance
  • On the system where the script is run

Installation

  1. git clone https://github.com/mjordan/get_islandora_content.git
  2. cd get_islandora_content
  3. php composer.phar install (or equivalent on your system, e.g., ./composer install)

Overview and usage

Run ./get_islandora_content --help to get help usage information:

-d/--dsid <argument>
     Datastream ID to harvest. Default is "OBJ".

-c/--collection <argument>
     A collection PID to harvest.

-h/--host <argument>
     The Islandora server's hostname, including the "http(s)://". The trailing "/" is optional.

-m/--mimetype <argument>
     A MIME type to restrict harvested objects to.

-o/--output_directory <argument>
     The full path to the output directory.

Examples of running the script include:

./get_islandora_content -h http://digital.lib.sfu.ca -c hbc:collection -o /tmp/testing ./get_islandora_content -h http://digital.lib.sfu.ca -c hbc:collection -o /tmp/testing -d PDF ./get_islandora_content -h http://digital.lib.sfu.ca -c hbc:collection -o /tmp/testing -m image/jpeg

The output will contain a metadata.xml file and and file corresponding to each retrieved objects' OBJ datastream. For example, a small collection of images results in the following output:

/tmp/test/
├── hbc_10.jpeg
├── hbc_11.jpeg
├── hbc_12.jpeg
├── hbc_13.jpeg
├── hbc_14.jpeg
├── hbc_15.jpeg
├── hbc_16.jpeg
├── hbc_17.jpeg
├── hbc_18.jpeg
├── hbc_19.jpeg
├── hbc_1.jpeg
├── hbc_20.jpeg
├── hbc_2.jpeg
├── hbc_3.jpeg
├── hbc_4.jpeg
├── hbc_5.jpeg
├── hbc_6.jpeg
├── hbc_7.jpeg
├── hbc_8.jpeg
├── hbc_9.jpeg
└── metadata.xml

The metadata.xml file contains all of the MODS datastreams retrieved from the OAI harvest, concatenated together and wrapped in a <modsCollection> element, e.g.:

<modsCollection>
<record xmlns="http://www.openarchives.org/OAI/2.0/"><header><identifier>oai:digital.lib.sfu.ca:hbc_2</identifier><datestamp>2017-08-01T11:02:59Z</datestamp><setSpec>hbc_collection</setSpec></header><metadata><mods xmlns="http://www.loc.gov/mods/v3" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
  <titleInfo>
    <title>A view of a rope bridge, (2) showing traffic</title>
  </titleInfo>
  <name>
    <namePart>Harrison Brown</namePart>
    <role>
      <roleTerm type="code" authority="marcrelator">pht</roleTerm>
      <roleTerm type="text" authority="marcrelator">Photographer</roleTerm>
    </role>
  </name>
  <originInfo>
    <dateIssued encoding="w3cdtf" keyDate="yes">1936-11-12</dateIssued>
  </originInfo>
  <abstract>Kwan Hsian</abstract>
  <genre authority="lcsh">photographs</genre>
  <accessCondition type="use and reproduction">Reproduction of the material is subject to the approval of the Special Collections and Rare Books Librarian</accessCondition>
  <identifier type="local"/>
  <typeOfResource>still image</typeOfResource>
  <identifier type="uuid">f7bc0c20-9bc6-4499-b1f3-7fda4eafeaf0</identifier>
  <identifier type="uri" invalid="yes" displayLabel="Migrated From">http://content.lib.sfu.ca/cdm/ref/collection/hbc/id/1</identifier>
</mods></metadata></record>

<!-- more MODS records here -->

</modsCollection>

To do

  • Confirm that the filenames are suitable for use by Migrate Plus
  • More, better error handling

Contributing

  • Testing, bug reporting, and questions are welcome. Please open an issue.
  • Pull requests are welcome, but if you want to open a pull request, please open an issue first.

Maintainer

License

The Unlicense