/mabxml-elasticsearch

Raw hbz union catalog data exposed via a web API

Primary LanguageJava

About

Index MAB-XML into Elasticsearch using Metafacture an serve it with Playframework.

Setup

Prerequisites: Java 8; verify with java -version

Create and change into a folder where you want to store the project:

  • mkdir ~/git ; cd ~/git

Get and change into the mabxml-elasticsearch repo:

  • git clone https://github.com/hbz/mabxml-elasticsearch.git
  • cd mabxml-elasticsearch

See the .github/workflows/build.yml file for details on the CI config used by Github Actions.

Index server setup

See also: Elasticsearch installation steps.

Download the latest 5.6.x Elasticsearch release, e.g. on Linux:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.16.zip

Unzip it and change into the new directory:

unzip elasticsearch-5.6.16.zip ; cd elasticsearch-5.6.16

Run the elasticsearch application in the bin/ folder in daemon mode (output is logged to logs/elasticsearch.log), and record the process id:

bin/elasticsearch -d -p pid

Access your local Elasticsearch server:

curl -X GET http://localhost:9200/

To shut down the Elasticsearch server, kill the process recorded in the pid file on startup:

kill `cat pid`

To continue with the setup and usage below, leave the server running or restart it, and change back to the project root directory:

cd ..

Web server setup

Download the minimal activator application (optionally, there’s an offline version available, see Playframework downloads documentation) to run the Play server:

wget https://downloads.typesafe.com/typesafe-activator/1.3.9/typesafe-activator-1.3.9-minimal.zip

Unzip it:

unzip typesafe-activator-1.3.9-minimal.zip

Start the Play server from the project root in background production mode (output is logged to console and logs/application.log, for development mode replace start with run):

activator-1.3.9-minimal/bin/activator start

The web applications index page can now be accessed at http://localhost:9000/hbz01.

Press Ctrl+D to return to the shell (since we called start, the server remains in background).

Transformation

To transform and index the data, POST to the transform/ route and pass arguments as query parameters.

Pass a directory with the data to transform (full local path, change sample below for your system), the file suffix, your Elasticsearch cluster name, node IP number, and index name, e.g.:

curl -XPOST "http://localhost:9000/hbz01/transform?dir=/home/fsteeg/git/mabxml-elasticsearch/test/&suffix=bz2&cluster=elasticsearch&hostname=127.0.0.1&index=hbz01"

This will index the data from the specified location to the cluster ‘elasticsearch’, using node ‘127.0.0.1’, into an index called ‘hbz01’.

Access

Index server data access

You can then GET a specific record in the index by hbz ID:

curl -XGET 'http://127.0.0.1:9200/hbz01/mabxml/HT012786619'; echo

You can also exclude the Elasticsearch metadata:

curl -XGET 'http://127.0.0.1:9200/hbz01/mabxml/HT012786619/_source'; echo

For details on the various options see the GET API documentation.

Web server data access

You can also GET data by ID using the Play server:

curl http://localhost:9000/hbz01/HT017665866

Unlike the Elasticsearch index queries above (which serve JSON), this serves XML:

curl http://localhost:9000/hbz01/HT017665866 | xmllint --format -

To shut down the server, kill the process recorded in the RUNNING_PID file:

kill `cat target/universal/stage/RUNNING_PID`

When running in foreground development mode (activator run), hitting CTRL+D stops the server.

Deployment

We run this transformation daily using a cron job that calls the cron.sh script. Internal documentation: to fully understand what is done when, trace the entries in crontab of hduser@weywot1.

The final index data is served at http://lobid.org/hbz01, with individual resource URLs like http://lobid.org/hbz01/HT012786619. Internal documentation: the application is deployed at sol@quaoar1:~/git/mabxml-elasticsearch, an Apache proxy is set up at emphytos:/etc/apache2/vhosts.d/lobid.org.conf.

License

Eclipse Public License: http://www.eclipse.org/legal/epl-v10.html