Lodmill

Blend, grind, and enjoy LOD – fresh from the mill!

This software implements the backend of http://lobid.org, the LOD service offered by hbz.

Build

See the .travis.yml file for details on the CI config used by Travis.

Technology Stack

Raw data is transformed with Metafacture and Hadoop, indexed in Elasticsearch, and exposed via an HTTP API implemented with the Play framework:

The lodmill-rd folder contains code for working with raw data and its transformation to LOD, based on the Culturegraph Metafacture toolkit. The lodmill-ld folder contains code for processing RDF with Hadoop and for indexing it in Elasticsearch. The hbz/lobid repo contains a web app based on the Play framework that interacts with the resulting Elasticsearch index. It provides an HTTP API to the index and a UI for documentation and basic sample usage. See below for details on the data workflow.

Setup

Prerequisites: Java 7 and Maven 3 (check mvn -version to make sure Maven is using Java 7)

To set up a local build of the lodmill components follow these steps:

Clone lodmill and run the Maven build

git clone https://github.com/lobid/lodmill.git
cd lodmill
umask 0022 (required for MRUnit tests)
mvn clean install --settings settings.xml (or copy ‘settings.xml’ to ‘~/.m2’)

Build Hadoop Jar

cd .. ; cd lodmill-ld
mvn clean assembly:assembly --settings ../settings.xml

See the convert.sh and index.sh scripts in lodmill-ld/doc/scripts on how to use the resulting Jar with your Hadoop and Elasticsearch clusters.

Clusters

For information on how we have set up our Hadoop and Elasticsearch backend clusters see README.textile in lodmill-ld.

Data Workflow

The complete data workflow transforms the original raw data to Linked Data served via an HTTP API:

Raw Data to Linked Data Triples

The raw data is transformed to linked data triples in the N-Triples RDF serialization. Being a line based format, N-Triples are well suited for batch processing with Hadoop. Every triple, both those generated from the raw data, and the enrichment data from external sources (like GND and Dewey), are stored in the Hadoop distributed file system (HDFS) and made available to the two Hadoop jobs that process and convert the data.

Linked Data Triples to Records

The first Hadoop job (implemented in CollectSubjects.java) selects triples that need to be resolved to build meaningful records to be indexed later. One example would be the creator triples for lobid-resources, e.g.

<http://lobid.org/resource/HT002189125> <http://purl.org/dc/elements/1.1/creator> <http://d-nb.info/gnd/118580604> .

The creator of the resource is identified via its GND ID. To allow searching by the actual name of the creator, we want to resolve the name literals, so we declare that the creator property needs to be resolved in the resolve.properties file:

resolve = \ http://purl.org/dc/elements/1.1/creator; \ [...]

Having the property declared to need resolution, when the first Hadoop job encounters the triple above, it will map the GND ID to the resource ID:

<http://d-nb.info/gnd/118580604> : <http://lobid.org/resource/HT002189125>

This information is written to a zipped Hadoop map file (a persistent lookup mechanism) after the first job is complete.

The second Hadoop job (implemented in NTriplesToJsonLd.java) collects all triples with the same subject (i.e. all statements about a resource), and converts them to a JSON-LD record. In addition to the selected triples, we also need details about the entities defined as needing resolution in the first job. For instance, we want the preferredNameForThePerson of creator in our records, so we declare it as a resolution property in resolve.properties:

predicates = \ http://d-nb.info/standards/elementset/gnd#preferredNameForThePerson; \ [...]

This will cause the second Hadoop job to perform a lookup on the subject of a triple containing that property, e.g.:

<http://d-nb.info/gnd/118580604> <http://d-nb.info/standards/elementset/gnd#preferredNameForThePerson> "Melville, Herman" .

Since the subject is mapped to the <http://lobid.org/resource/HT002189125> resource ID, we add that triple to the triples of that resource, which yields a record for <http://lobid.org/resource/HT002189125> that contains not only the triples with that subject, but also the enrichment triples defined in the resolve.properties file. That way, we effectively define our records to be subgraphs of the complete triple set.

The same mechanism is used to resolve information modeled using blank nodes. See our current resolve.properties file for details.

Linked Data Records to Index

The second Hadoop job writes the records as expanded JSON-LD in the Elasticsearch bulk import format, which is then indexed in Elasticsearch using the Elasticsearch bulk Java API. We use expanded JSON-LD in the index to have consistent types for each field. In compact JSON-LD, if a field has just a single value, the type of that field will simply be the type of the value. If the same field has multiple values, the type will be an array, etc. Elasticsearch learns the index schema from the data, so we need to use consistent types for a given field. The expanded JSON-LD serialization does exactly this. For instance, it always uses arrays, even if there is only a single value.

Linked Data Index to HTTP API

Finally our Play frontend accesses the Elasticsearch index and serves the records as JSON-LD. There are multiple options for queries and different supported results formats, see documentation at http://api.lobid.org (implemented in the app/views/index.scala.html template in the hbz/lobid repo). Since the expanded JSON-LD described above is cumbersome, the API serves compact JSON-LD. It also uses an external JSON-LD context document to allow shorter keys instead of full URI properties, and to encapsulate the actual properties used in the expanded form. That way, we can change the properties, without requiring API clients to change how they process the JSON responses.

License

Eclipse Public License: http://www.eclipse.org/legal/epl-v10.html

fsteeg/lodmill