/dwc-archive-plugin

A collectory plugin that allows users to check Darwin Core Archives before loading them.

Primary LanguageGroovy

#DwC Archive Plugin

A plugin for the ala-collectory that provides tools for checking Darwin Core Archives (DwCA) for various useful features that the ALA needs.

Functions

Functions are accessible from the /archive path for a user interface and the /ws/archive path for a web service.

Start page

/archive

Provides the root of the user interface pages.

Archive Validation

/archive/validateArchive (UI) or /ws/archive/validateArchive (WS)

Checks a DwCA for suitability using the following parameters:

Name Description Default
source A URL pointing to the DwCA (required)
checkRecords Check individual records for validity true
checkUniqueTerms Check occurrence records to ensure that they have a unique key for loading into the ALA true
uniqueTermList A comma-separated list of DwC terms for key building catalogNumber
checkImages Check to see if there is a usable image extension true
checkPresence Check image presence in the archive if it is listed in the image extension true

These parameters can be either part of a GET parameter list or a POST form.

The user interface returns a HTML report. The web service returns a JSON report by default, using a .json, .xml or .html extension will return a suitably formatted report. Eg. /ws/archive/check.xml?source=http://somewhere.com/archive.zip will return an XML report.

Flatten Archive

/archive/flattenMeasurementArchive (UI) or /ws/archive/flatten (WS)

Converts a DWCA with a MeasurementOrFact extension file into a single, flattened file, with the measurements converted into extra columns. The UI version provides an initial analysis and then allows editing of terms. The web-service version provides a straight-through service, possibly with user-defined mappings provided in a mapping file.

Name Description Default
source A URL pointing to the DWCA
sourceFile An uploaded DWCA file (via multipart/form-data)
mappingFile A JSON configuration and mapping file (via multipart/form-data) -- see below for content
filter A filter to apply to the entries -- see below for content
values Additional values to add to the entries -- see below for content
format Either csv or dwca for results in either CSV or DWCA form csv
allowNewTerms If true, then the flattening will continue of unrecognised terms are encountered, with auto-generated mapping. If false, an error is returned if an unrecognised term is encountered true
valueSeparator The string to use when separating multiple values that all map onto the same term |

Term Names

If a term from the mapping page can be found in either Darwin Core or Dublin Core then the URI of that term is used. This means that a term such as location can be mapped onto locality and share the same values as existing DwC terms.

Mapping File

The mapping file is a JSON file that contains a description of how to map measurement types onto terms suitable for loading into the ALA. An example mapping file is:

{
  "terms": [
    {
      "term":"availablePhosphate",
      "uri":"http://vocabulary.ala.org.au/availablePhosphate",
      "measurementType":"Available Phosphate (mg/L)",
    }
  ],
  "filter":"basisOfRecord == \"HumanObservation\"",
  "values": {
    "parentEventID": "Project-234",
    "country": "Australia"
  }
}

The terms list contrains a mapping from a measurementType value to a term that becomes a column in the flattened file. In this case a measurement with a measurementType of Available Phosphate (mg/L) will appear as a column labelled availablePhosphate. If a term is not in the terms list and allowNewTerms is true, a term will be generated from the measurement type. The generated term will try and camel-case the measurement type and spell out the unit. For example, Dissolved Oxygen (mg/L) would become dissolvedOxygenInMilligramPerLitre

If you are using the UI, the terms will be collected and displayed before the archive is flattened, so that you can massage the term names and map accidentally different but the same measurement types onto a single term. You can save the mapping for future use.

Filters

Filters are fairly restrictive expressions that allow you to select occurrence records based on their content. The basic elements of the filters, in order of precedence, are:

Element Syntax Description Example
term term A term (column name) preferentially from the core but also from an extension basisOfRecord
string "string" A literal string "HumanObservation"
equality test == Test to see if a term has a value basisOfRecord == "HumanObservation"
negation not Logical negation not basisOfRecord == "HumanObservation"
conjunction and Logical conjunction basisOfRecord == "HumanObservation" and catalogNumber == "1234"
disjunction or Logical disjunction basisOfRecord == "HumanObservation" or basisOfRecord == "Observation"
grouping ( ) Expression grouping not (basisOfRecord == "HumanObservation" or basisOfRecord == "Observation")

Note that there needs to be spaces between tokens such as identifiers or strings. That's what you get for using a trivial tokenizer.

Values

Values allow you to add or override terms in the resulting output. The syntax for additional values is term1 = "value1", term2 = "value2", ... For example parentEventID = "Project-234", occurrenceStatus = "present"

Values can also be set in the mapping file, where a name: value dictionaty is used.

Curl usage

Flattening can be invoked via curl with the following curl command:

curl -o flattened.zip -F sourceFile=@orginal.zip -F mappingFile=@mapping.json -F format=dwca http://host/ws/archive/flatten

Dependencies

To use the plugin, add runtime ":dwc-archive:0.1-SNAPSHOT" to the plugins list in BuildConfig.groovy

Configuration

The checker uses a a work directory for unzipping archives. This is specified by the workDir configuration property. The plugin cleans up after itself, so data in the work directory is removed after analysis.