bib-rdf-pipeline

This repository contains various scripts and configuration for converting MARC bibliographic records into RDF, for use at the National Library of Finland.

The main component is a conversion pipeline driven by a Makefile that defines rules for realizing the conversion steps using command line tools.

The steps of the conversion are:

Start with a file of MARC records in Aleph sequential format
Split the file into smaller batches
Preprocess using unix tools such as grep and sed, to remove some local peculiarities
Convert to MARCXML and enrich the MARC records, using Catmandu
Run the Library of Congress marc2bibframe XQuery conversion from MARC to BIBFRAME RDF, using marc2bibframe-wrapper
Calculate work keys (e.g. author+title combination) used later for merging data about the same creative work
Convert the BIBFRAME data into Schema.org RDF in N-Triples format
Merge the Schema.org data about the same works
Convert the raw Schema.org data to HDT format so the full data set can be queried with SPARQL from the command line
Consolidate the data by e.g. rewriting URIs and moving subjects into the original work
Convert the consolidated data to HDT
??? (TBD)
Profit!

Dependencies

Command line tools are assumed to be available in $PATH, but the paths can be overridden on the make command line, e.g. make CATMANDU=/opt/catmandu

For running the main suite

Apache Jena command line utilities sparql and rsparql
Catmandu utility catmandu
uconv utility from Ubuntu package icu-devtools
marc2bibframe-wrapper and marc2bibframe
hdt-cpp command line utilities rdf2hdt and hdtSearch
hdt-java command line utility hdtsparql.sh

For running the unit tests

In addition to above:

bats in $PATH
xmllint utility from Ubuntu package libxml2-utils in $PATH

emulatingkat/bib-rdf-pipeline

bib-rdf-pipeline

Dependencies

For running the main suite

For running the unit tests