Storing metadata as rows in a CSV file is a great way to get migrated into CLAW. CSVs are easy to understand and work with, and there's good tooling available for using them to migrate. The migrate_source_csv contrib module provides a source plugin that's compatible with Drupal 8's Migrate API. And by using the migrate_plus module, you can model customized migrations using yml and package them up as features.
In fact, this module is one such feature. It even contains a data
directory filled with some sample images and a CSV full of metadata. In this README, we'll be inspecting each migration file in detail before running it . You'll start out by migrating the images themselves first, and then you'll create various Drupal entities to describe the files from the metadata in the CSV. It's not as scary as it sounds, but you will need a few things before beginning:
- An instance of Islandora CLAW. Use CLAW playbook to spin up an environment pre-loaded with all the modules you need (except this one)
- Some basic command line skills. You won't need to know much, but you'll have to
vagrant ssh
into the box, navigate into Drupal, and usegit
anddrush
, etc... If you can copy/paste into a terminal, you'll survive.
A big part of this tutorial relies on the islandora_demo and controlled_access_terms_default_configuration features, which define the default metadata profile for Islandora (which we'll be migrating into). You're not required to use the islandora_demo
or controlled_access_terms_default_configuration
for your repository, but for the purposes of demonstration, it saves you a lot of UI administrivia so you can focus just on the learning how to migrate. By the time you are done with this exercise, you'll be able to easily apply your knowledge to migrate using any custom metadata profile you can build using Drupal.
In Islandora, migrations involve creating several different types of content entities in Drupal to represent a single item in a repository. Each row in the CSV must contain enough information to create
- a file, which holds the actual binary contents of an item
- a node, which holds the descriptive metadata for an item
- a media, which holds technical metadata and references the file and the node, linking the two together
However, buried in your descriptive metadata are often references to other entities which aren't repostiory items themselves, but records still need to be kept for them. Authors, publishers, universities, places, etc... are all their own entities, and are referenced by other entities. So there's the potential to have a lot of different entity types described in a single row in a CSV.
In this tutorial, we're working wth islandora_demo
and controlled_access_terms
entities and will be migrating 5 entity types in total.
- file
- node
- media
- subject
- person
We'll do this by creating three migrations, which follow the Extract-Transform-Load pattern. You extract the information from a source, process the data to transform it into the format you need, and load it into the destination system (e.g. Drupal). Migrations are stored in Drupal as configuration, which means they can be represented in yml, transferred to and from different sites, and are compatible with Drupal's configuration synchronization tools. And the structure of each yml file is arranged to follow the Extract-Transform-Load pattern.
Now we're migrating five entity types, but we're only writing three migrations: files, nodes, and media. The other two, subjects and agents, will be generated during the node migration. This will give us a chance to show off some techniques for working with multi-valued fields, entity reference fields, and complex field types like controlled_access_terms
's typed_relation
field. We'll also see how the migrate framework can help de-duplicate, and at the same time, linked data-ize ™️ your data by looking up previously migrated entities. So hold on to your hats. First, let's get this puppy onto your Islandora instance.
To perform the migrations, we'll be using drush
. We will be able to run each of the file, node, and media migrations seprarately or all at once in a group. We will also learn how to roll back a migration in case it didn't go as planned.
From your claw-playbook
directory, issue the following commands to clone down this module from git:
vagrant ssh
to shell into your Islandora instance.cd /var/www/html/drupal/web/modules/contrib
to get to your modules directory.git clone https://github.com/dannylamb/migrate_islandora_csv
to clone down the repository from github.drush en migrate_islandora_csv
to enable the module, installing the migrations as configuration.
Now lets go migrate some files.
To ingest files from CSV, you need a column containing paths to the files you wish to ingest. These files need to be accessible from the server that's running Drupal so that the migrate framework can find them. This tutorial assumes you're working with the sample images provided in the module, which will be located at /var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/data/images
.
Open up the csv file at /var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/data/migration.csv
, and you'll see a file
column in there populated with paths to the sample images.
file |
---|
/var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/data/images/Nails Nails Nails.jpg |
/var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/data/images/Free Smells.jpg |
/var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/data/images/Nothing to See Here.jpg |
/var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/data/images/Call For Champagne.jpg |
/var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/data/images/This Must Be The Place.jpg |
Open up the files migration at /var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/config/install/migrate_plus.migration.file.yml
. You'll see the following migration config:
id: file
label: Import Image Files
migration_group: migrate_islandora_csv
source:
plugin: csv
path: '/var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/data/migration.csv'
delimiter: ','
# 1 means you have a header row, 0 means you don't
header_row_count: 1
# Each migration needs a unique key per row in the csv. Here we're using the file path.
keys:
- file
# You can't enter string literals into a process plugin, but you can give it a constant as a 'source'.
constants:
# Islandora uses flysystem and stream wrappers to work with files. What we're really saying here is
# to put these files in Fedora in a 'csv_migration' folder. It doesn't matter if the directory
# doesn't exist yet, it will get created for you automatically.
destination_dir: 'fedora://csv_migration'
process:
##
# The following two fields are temporary, and just used to generate a destination for the file.
##
# Hack the file name out of the full path provided in the 'file' column.
filename:
-
plugin: callback
callable: pathinfo
source: file
-
plugin: extract
index:
- basename
# Construct the destination URI using the file name.
destination:
plugin: concat
delimiter: /
source:
- constants/destination_dir
- '@filename'
##
# Here's where we copy the file over and set the uri of the file entity.
##
uri:
plugin: file_copy
source:
- file # The source column in the CSV
- '@destination' # The destination entry from above
destination:
# These are Drupal 'image' entities we're making, not just plain 'file' entities.
plugin: 'entity:file'
type: image
It seems like a lot to take in at first, but there's a pattern to Drupal migrations. They always contain three key sections: source
, process
, and destination
. And these sections correspond exactly to Extract, Transform, and Load.
The source
section contains the configuration needed to create a Drupal source plugin that will extract the data. A source plugin provides "rows" of data to processing plugins so that they can be worked on. In this case, we're using the csv
source plugin, which very literally uses rows, however you can have source plugins that work with other data formats like XML and JSON. Look at the config from this section.
source:
plugin: csv
path: '/var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/data/migration.csv'
delimiter: ','
header_row_count: 1
keys:
- file
constants:
destination_dir: 'fedora://csv_migration'
You can see we provide a path to its location, what delimiter to use, if it uses a header row, and which column contains a unique key for each entry. Constants can also be defined in the source
section (more on those later).
The process
section contains entries for a series of processing steps to transform the source data. Each step has a name and contains the configuration for one or more process plugins. Multiples plugins are executed in sequence, with the results getting passed from one to another, forming a pipeline. In this fashion, you can transform data from the CSV into a format that Drupal is expecting. There are many process plugins available, and we'll cover several throughout this tutorial.
For each row of the CSV, each of these steps will be executed. If the name of a step happens to be the same as a field or property name, the migrated entity will have that value for that field or property. This is how you can apply metadata from the CSV to an entity. If it's not named after a field or property, the migrate framework assumes it's a temporary value you're using as part of more complex logic. It won't wind up on the entity when the migration is done, but it will be available for you to use within other process plugins. You can always spot a temporary value by the fact that it's prefixed with an @
. You can also pass constants into process plugins, which are prefixed with constants/
.
The destination
section contains the configuration that describes what gets loaded into Drupal.
destination:
plugin: 'entity:file'
type: image
You can create any type of content entity in Drupal. In this case, we're making file entities. Specifically, we're making images, which are a special type of file entity.
In the process
section of the migration, we're copying the images over into a Drupal file system and setting the uri
property on the corresponding File entity.
uri:
plugin: file_copy
source:
- file
- '@destination'
To do this, we're using the file_copy
process plugin. But to use it, we have to know where a file is located and where we it want it copied to. We know where the file resides, we have that in the CSV's file
column. But we're going to have to do some string manipuation in order to generate the new location where we want the file copied. We're trying to convert something like /var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/data/images/Free Smells.jpg
to fedora://csv_migration/Free Smells.jpg
.
The uri we're constructing is a stream wrapper of the form scheme://path/to/file
. Islandora uses flysystem
, which allows for integration with many different types of filesystems, both remote and local. With flysystem
, the scheme part of the uri is the name of a filesystem. By default, Fedora is exposed using the scheme fedora://
. So by setting uri to fedora://csv_migration/Free Smells.jpg
, we're saying "put Free Smells.jpg in the csv_migration directory in Fedora."
Now, to perform this string manipulation in PHP, we'd do something like
$info = pathinfo($filepath);
$filename = $info['basename'];
$destination = "fedora://csv_migration/" . $filename;
Which we will mimic exactly in the process
section of our migration config. Just like we declare variables and call functions with PHP code, we can make entries in the process
section to store the output of Drupal process plugins. We'll build up a destination
'variable' and pass it into the file_copy
process plugin.
To start, we'll get the filename using two process plugins:
filename:
-
plugin: callback
callable: pathinfo
source: file
-
plugin: extract
index:
- basename
The first process plugin, callback
, lets you execute any PHP function that takes a single input and returns an output. It's not as flexible as making your own custom process plugin, but it's still pretty useful in a lot of situations. Here we're using it to call pathinfo()
, telling it to use the file
column in the CSV as input. We pass the resulting array from pathinfo()
to the extract
process plugin, which pulls data out of arrays using the keys you provide it under index
.
Now that we have the file name, we have to prepend it with fedora://csv_migration/
to make the destination uri. In our PHP code above, we used .
to concatenate the strings. In the migration framework, we use the concat
process plugin. You provide it with two or more strings to concatenate, as well as a delimiter.
destination:
plugin: concat
delimiter: /
source:
- constants/destination_dir
- '@filename'
In our PHP code, we concatenated the $filename
variable with a string literal. In our process plugin, we can provide the variable, e.g. the output of the filename
process step, by prefixing it with an @
. We can't, however, pass in fedora://csv_migration
directly as a string. At first glance, you might think something like this would work, but it totally doesn't:
# Can't do this. Won't work at all.
destination:
plugin: concat
delimiter: /
source:
- 'fedora://csv_migration'
- '@filename'
That's because the migrate framework only interprets source
values as names of columns from the csv or names of other process steps. Even if they're wrapped in quotes. It will never try to use the string directly as a value. To circumvent this, we decare a constant in the source
section of the migration config.
constants:
destination_dir: 'fedora://csv_migration'
This constant can be referenced as constants/destination_dir
and passed into the concat process plugin as a source
.
Migrations can be executed via drush
using the migrate:import
command. You specify which migration to run by using the id defined in its yml. To run the file migration from the command line, make sure you're within /var/www/html/drupal/web
(or any subdirectory) and enter
drush migrate:import file
If you've already run the migration before, but want to re-run it for any reason, use the --update
flag.
drush migrate:import file --update
You may have noticed that migrations can be grouped, and that they define a migration_group
in their configuration. You can execute an entire group of migrations using the --group
flag. For example, to run the entire group defined in this module
drush migrate:import --group migrate_islandora_csv
You can also use the migrate:rollback
command to delete all migrated entities. Like migrate:import
, it also respects the --group
flag. So to rollback everything we just did:
drush migrate:rollback --group migrate_islandora_csv
If something goes bad during development, sometimes migrations can get stuck in a bad state. Use the migrate:reset
command to put a migration back to Idle
. For example, with the file
migration, use
drush migrate:reset file
Make sure you've run (and not rolled back) the file
migration. It should tell you that it successfully created 5 files. You can confirm its success by visiting http://localhost:8000/admin/content/files. You should see 5 images of neon signs in the list.
Those five images are nice, but we need something to hold their descriptive metadata and show them off. We use nodes in Drupal to do this, and that means we have another migration file to work with. Nestled in with our nodes' descriptive metadata, though, are more Drupal entities, and we're going to generate them on the fly while we're making nodes. While we're doing it, we'll see how to use pipe delimited strings for multiple values as well as how to handle typed_relation
fields that are provided by controlled_access_terms
. Open up /var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/config/install/migrate_plus.migration.node.yml
and check it out.
# Uninstall this config when the feature is uninstalled
dependencies:
enforced:
module:
- migrate_islandora_csv
id: node
label: Import Nodes from CSV
migration_group: migrate_islandora_csv
# Pull from a CSV, and use the 'file' column as an index
source:
plugin: csv
path: modules/contrib/migrate_islandora_csv/data/migration.csv
header_row_count: 1
keys:
- file
constants:
model: Image
relator: 'relators:pht'
# Set fields using values from the CSV
process:
title: title
# We use the skip_on_empty plugin because
# not every row in the CSV has subtitle filled
# in.
field_alternative_title:
plugin: skip_on_empty
source: subtitle
method: process
field_description: description
# Dates are EDTF strings
field_edtf_date: issued
# Make the object an 'Image'
field_model:
plugin: entity_lookup
source: constants/model
entity_type: taxonomy_term
value_key: name
bundle_key: vid
bundle: islandora_models
# Split up our pipe-delimited string of
# subjects, and generate terms for each.
field_subject:
-
plugin: skip_on_empty
source: subject
method: process
-
plugin: explode
delimiter: '|'
-
plugin: entity_generate
entity_type: taxonomy_term
value_key: name
bundle_key: vid
bundle: subject
# Complex fields can have their individual
# parts set independently. Use / to denote
# you're working with a property of a field
# directly.
field_linked_agent/target_id:
plugin: entity_generate
source: photographer
entity_type: taxonomy_term
value_key: name
bundle_key: vid
bundle: person
# Hard-code the rel_type to photographer
# for all the names in the photographer
# column.
field_linked_agent/rel_type: constants/relator
# We're making nodes
destination:
plugin: 'entity:node'
default_bundle: islandora_object
The source
section looks mostly the same other than some different constants we're defining. If you look at the process
section, you can see we're taking the title
, description
, and issued
columns from the CSV and applying them directly to the migrated nodes without any manipulation.
title: title
field_description: description
field_edtf_date: issued
For subtitle
, we're passing it through the skip_on_empty
process plugin because not every row in our CSV has a subtitle entry. It's very useful when you have spotty data, and you'll end up using it a lot. The method: process
bit tells the migrate framework only skip that particular field if the value is empty, and not to skip the whole row. It's important, so don't forget it. The full yml for setting field_alternative_title
from subtitle looks like this:
field_alternative_title:
plugin: skip_on_empty
source: subtitle
method: process
Now here's where things get interesting. We can look up other entities to populate entity reference felds. For example, all Repository Items have an entity reference field that holds a taxonomy term from the islandora_models
vocabulary. All of our examples are images, so we'll look up the Image model in the vocabulary since it already exists (it gets made for you when you use claw-playbook). We use the entity_lookup
process plugin to do this.
field_model:
plugin: entity_lookup
source: constants/model
entity_type: taxonomy_term
# 'name' is the string value of the term, e.g. 'Original file', 'Thumnbnail'.
value_key: name
bundle_key: vid
bundle: islandora_models
The entity_lookup
process plugin looks up an entity based on the configuration you give it. You use the entity_type
, bundle_key
, and bundle
configurations to limit which entities you search through. entity_type
is, as you'd suspect, the type of entity: node, media, file, taxonomy_term, etc... bundle_key
tells the migrate framework which property holds the bundle of the entity, and bundle
is the actual bundle id you want to restrict by. The search value you're looking for is the source
configuration. In this case we're looking for the string "Image", which we've defned as a constant. And we're comparing it to the name
field on each term by setting the value_key
config.
If you're not sure that the entities you're looking up already exist, you can use the entity_generate
plugin, which takes the same config, but will create a new entity if the lookup fails. We use this plugin to create subject
taxonomy terms that we tag our nodes with. A node can have multiple subjects, so we've encoded them in the CSV as pipe delimited strings.
subject |
---|
Neon signs|Night |
Neon signs|Night|Funny |
Neon signs|Night |
Drinking|Neon signs |
Neon signs |
We can hack those apart easily enough. In PHP we'd do something like
$subjects = explode($string, '|');
$terms = [];
foreach ($subjects as $name) {
$terms[] = \Drupal::service('entity_type.manager')->getStorage('taxonomy_term')->create([
...
'vid' => 'subject',
'name' => $name,
...
]);
}
$node->set('field_subject', $terms);
With process plugins, that logic looks like
field_subject:
-
plugin: skip_on_empty
source: subject
method: process
-
plugin: explode
delimiter: '|'
-
plugin: entity_generate
entity_type: taxonomy_term
value_key: name
bundle_key: vid
bundle: subject
Here we've got a small pipeline that uses the skip_on_empty
process plugin, which we've already seen, followed by explode
. The explode
process plugin operates exactly like its PHP counterpart, taking an array and a delimiter as input. The combination of skip_on_empty
and explode
behave like a foreach loop on the explode results. If we have an empty string, nothing happens. If there's one or more pipe delimited subject names in the string, then entity_generate
gets called for each name that's found. The entity_generate
process plugin will try to look up a subject by name, and if that fails, it creates one using the name and saves a reference to it in the node. So entity_generate
is actually smarter than our pseudo-code above, because it can be run over and over again and it won't duplicate entities 🍾
Some fields don't hold just a single type of value. In other words, not everything is just text, numbers, or references. Using the Typed Data API, fields can hold groups of named values with different types. Consider a field that holds an RGB color. You could set it with PHP like so:
$node->set('field_color', ['R' => 255, 'G' => 255, 'B' => 255]);
You could even have a multi-valued color field, and do something like this
$node->set('field_color', [
['R' => 0, 'G' => 0, 'B' => 0],
['R' => 255, 'G' => 255, 'B' => 255],
]);
In the migrate framework, you have two options for handling these types of fields. You can build up the full array they're expecting, which is difficult and often impossible to do without writing a custom process plugin. Or you set each named value in the field with separate process pipelines.
In controlled_access_terms
, we have a notion of a typed_relation
, which is an entity reference coupled with a MARC relator. It expects an associative array that looks like this:
[ 'target_id' => 1, 'rel_type' => 'relators:ctb']
The target_id
portion takes an entity id, and rel_type takes the predicate for the marc relator we want to use to describe the relationship the entity has with the repository item. This example would reference taxonomy_term 1 and give it the relator for "Contributor".
If we want to set those values in yml, we can access target_id
and rel_type
independently by accessing them with a /
.
field_linked_agent/target_id:
plugin: entity_generate
source: photographer
entity_type: taxonomy_term
value_key: name
bundle_key: vid
bundle: person
field_linked_agent/rel_type: constants/relator
Here we're looking at the photographer
column in the CSV, which contains the names of the photographers that captured these images. Since we know these are photographers, and not publishers or editors, we can bake in the relator
constant we set to relators:pht
in the source
section of the migration. So all that's left to do is to set the taxonomy term's id via entity_generate
. If the lookup succeeds, the id is returned. If it fails, a term is created and its id is returned. In the end, by using the /
syntax to set properties on complex fields, everything gets wrapped up into that nice associative array structure for you automatically. Now let's run that migration.
Like with the file migration
drush migrate:import node
from anywhere within the Drupal installation directory will fire off the migration. Go to http://localhost:8000/admin/content and you should see five new nodes. Click on one, though, and you'll see it's just a stub with metadata. The csv metadata is there, links to other entities like subjects and photographers are there, but there's no trace of the corresponding files. Here's where media entities come into play.
Media entities are Drupal's solution for fieldable files. Since you can't put fields on a file, what you can do is wrap the file with a Media entity. In addition to a file reference, technical and structural metadata for the file go on the Media entity. For example, mimetype, file size, resolution, etc... all belong on a Media entity. Media also have a few special fields that are required for Islandora, field_media_of
and field_use
, which denote what node owns the media and what role the media serves, repectively. Since the Media entity references both the file it wraps and the node that owns it, Media entities act as a bridge between files and nodes, tying them together. And to do this, we make use of one last process plugin, migration_lookup
. Open up /var/www/html/drupal/web/modules/contrib/migrate_islandora_csv/config/install/migrate_plus.migration.media.yml
and give it a look.
# Uninstall this config when the feature is uninstalled
dependencies:
enforced:
module:
- migrate_islandora_csv
id: media
label: Import Media from CSV
migration_group: migrate_islandora_csv
source:
plugin: csv
path: modules/contrib/migrate_islandora_csv/data/migration.csv
# 1 means you have a header row, 0 means you don't
header_row_count: 1
# Each migration needs a unique key per row in the csv. Here we're using the file path.
keys:
- file
# You can't enter string literals into a process plugin, but you can give it a constant as a 'source'.
constants:
# We're tagging our media as Original Files
use: Original File
# Everything gets created as admin
uid: 1
process:
name: title
uid: constants/uid
# Make the media an 'Original File'
field_media_use:
plugin: entity_lookup
source: constants/use
entity_type: taxonomy_term
value_key: name
bundle_key: vid
bundle: islandora_media_use
# Lookup the migrated file in the file migration.
field_media_image:
plugin: migration_lookup
source: file
migration: file
no_stub: true
# Lookup the migrated node in the node migration
field_media_of:
plugin: migration_lookup
source: file
migration: node
no_stub: true
destination:
# These are 'image' media we're making.
plugin: 'entity:media'
default_bundle: image
migration_dependencies:
required:
- migrate_plus.migration.file
- migrate_plus.migration.node
optional: { }
Compared to the other migrations, this one is very straightforward. There's no string or array manipulation in yml, and at most there's only one process plugin per field. Title and user are set directly, with no processing required
name: title
uid: constants/uid
The field_media_use
field is a tag that's used to denote the purpose of a file with regard to the node it belongs to. E.g. is this the original file? a lower quality derivative? thumbnail? etc... In many ways it bears a resemblance to DSID in Islandora 7.x. Like field_model
with nodes, the vocabulary already exists in your Islandora install, so all you have to do is look it up with the entity_lookup
plugin.
# Make the media an 'Original File'
field_media_use:
plugin: entity_lookup
source: constants/use
entity_type: taxonomy_term
value_key: name
bundle_key: vid
bundle: islandora_media_use
The field_media_image
and field_media_of
fields are how the media binds a file to a node. You could use entity_lookup
or entity_generate
, but we've already migrated them and can very easily look them up by the id assigned to them during migration. But what's the benefit of doing so? The entity_lookup
and entity_generate
process plugins do the job fine, right?
The main advantage of using migration_lookup
and defining migrations whenever possible, is that migrated entites can be rolled back. If you were to hop into your console and execute
drush migrate:rollback --group migrate_islandora_csv
Your nodes, media, and files would all be gone. But your subjects and photographers would remain. If you want to truly and cleanly roll back every entity in a migration, you need to define those migrations and use migration_lookup
to set entity reference fields.
If you've made it all the way to the end here, then you've learned that you can migrate files and CSV metadata into Islandora using only yml files. You've seen how to transform data with pipelines of processing plugins and can handle numeric, text, and entity reference fields. You can handle multiple values for fields, and even more complicated things like typed_relation
fields. And as big as this walkthrough was, we're only scratching the surface of what can be done with the Migrate API.
There's certainly more you can do with Drupal 8's Migrate API. There's a plethora of source and processing plugins out there that can handle pretty much anything you throw at it. XML and JSON are fair game. You can also request sources using HTTP, so you can always point it at an existing systems REST API and go from there. Eventually, you'll also have to write your own process plugin if you can't make your logic work with the Migrate API's sometimes awkward workflow.
But really the best thing to do is try and get your data into Islandora! You can use the boilerplate
branch of this repository to clone down an empty migration, ready for you to customize to fit your data. And as you assmble it into CSV format, keep in mind that if you have more than just names for things like subjects and authors, that you can always make more CSVs. Think of it like maintaining tables in an SQL database. Each CSV has unique keys, so you can lookup/join entiities between CSVs using those keys. And you can still pipe delimit the keys like we did in our example to handle multi-vlaued fields.
The best part is, if you maintain these CSVs, they can be used to make bulk updates to metadata. Just make your changes, then run the migration(s) again with the --update
flag. It might not be the most efficient way to do it, as you'll update every entity, even if it didn't change. But you can manage that by breaking things down per collection or object type. If you can keep the CSVs sufficiently small, it's a viable solution for a small repository.