Index MAB-XML into Elasticsearch using Metafacture an serve it with Playframework.
Prerequisites: Java 8; verify with java -version
Create and change into a folder where you want to store the project:
mkdir ~/git ; cd ~/git
Get and change into the mabxml-elasticsearch repo:
git clone https://github.com/hbz/mabxml-elasticsearch.git
cd mabxml-elasticsearch
See the .github/workflows/build.yml
file for details on the CI config used by Github Actions.
See also: Elasticsearch installation steps.
Download the latest 5.6.x Elasticsearch release, e.g. on Linux:
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.16.zip
Unzip it and change into the new directory:
unzip elasticsearch-5.6.16.zip ; cd elasticsearch-5.6.16
Run the elasticsearch
application in the bin/
folder in daemon mode (output is logged to logs/elasticsearch.log
), and record the process id:
bin/elasticsearch -d -p pid
Access your local Elasticsearch server:
curl -X GET http://localhost:9200/
To shut down the Elasticsearch server, kill the process recorded in the pid
file on startup:
kill `cat pid`
To continue with the setup and usage below, leave the server running or restart it, and change back to the project root directory:
cd ..
Download the minimal activator application (optionally, there’s an offline version available, see Playframework downloads documentation) to run the Play server:
wget https://downloads.typesafe.com/typesafe-activator/1.3.9/typesafe-activator-1.3.9-minimal.zip
Unzip it:
unzip typesafe-activator-1.3.9-minimal.zip
Start the Play server from the project root in background production mode (output is logged to console and logs/application.log
, for development mode replace start
with run
):
activator-1.3.9-minimal/bin/activator start
The web applications index page can now be accessed at http://localhost:9000/hbz01.
Press Ctrl+D
to return to the shell (since we called start
, the server remains in background).
To transform and index the data, POST to the transform/
route and pass arguments as query parameters.
Pass a directory with the data to transform (full local path, change sample below for your system), the file suffix, your Elasticsearch cluster name, node IP number, and index name, e.g.:
curl -XPOST "http://localhost:9000/hbz01/transform?dir=/home/fsteeg/git/mabxml-elasticsearch/test/&suffix=bz2&cluster=elasticsearch&hostname=127.0.0.1&index=hbz01"
This will index the data from the specified location to the cluster ‘elasticsearch’, using node ‘127.0.0.1’, into an index called ‘hbz01’.
You can then GET a specific record in the index by hbz ID:
curl -XGET 'http://127.0.0.1:9200/hbz01/mabxml/HT012786619'; echo
You can also exclude the Elasticsearch metadata:
curl -XGET 'http://127.0.0.1:9200/hbz01/mabxml/HT012786619/_source'; echo
For details on the various options see the GET API documentation.
You can also GET data by ID using the Play server:
curl http://localhost:9000/hbz01/HT017665866
Unlike the Elasticsearch index queries above (which serve JSON), this serves XML:
curl http://localhost:9000/hbz01/HT017665866 | xmllint --format -
To shut down the server, kill the process recorded in the RUNNING_PID
file:
kill `cat target/universal/stage/RUNNING_PID`
When running in foreground development mode (activator run
), hitting CTRL+D
stops the server.
We run this transformation daily using a cron job that calls the cron.sh
script. Internal documentation: to fully understand what is done when, trace the entries in crontab of hduser@weywot1.
The final index data is served at http://lobid.org/hbz01, with individual resource URLs like http://lobid.org/hbz01/HT012786619. Internal documentation: the application is deployed at sol@quaoar1:~/git/mabxml-elasticsearch, an Apache proxy is set up at emphytos:/etc/apache2/vhosts.d/lobid.org.conf.
Eclipse Public License: http://www.eclipse.org/legal/epl-v10.html