These Docker images can be used to either download and convert metadata files or to build prepopulated images containing data. they can be used to provide Docker volumes for enrichment or Kubernetes side cars to populate persistent volumes.
docker buildx build -f docker/harvester/Dockerfile .
docker buildx build -f docker/converter/Dockerfile .
Just edit the file harvester/config.yaml
and mount it readonly for the run. Also provide a data directory:
docker run -it --mount type=bind,source="$(pwd)"/data,target=/data --mount type=bind,source="$(pwd)"/harvester/config.yaml,target=/opt/harvester/config.yaml,readonly ghcr.io/cmahnke/oai-harvester-docker/harvester
Makes sure to change the /data
mount target to the path in your config.yaml
and that the local directory exits.
Pass configuration via environment, also make sure too provide a data directory:
docker run -it -e OAI_URL=https://www.kenom.de/oai/ -e METADATA_PREFIX=lido -e SET=institution:DE-MUS-062622 -e TARGET_DIR=/data --mount type=bind,source="$(pwd)"/data,target=/data ghcr.io/cmahnke/oai-harvester-docker/harvester
See table below for possible values.
docker run -e OAI_USER=$OAI_USER -e OAI_PASS=$OAI_PASS --mount type=bind,source="$(pwd)"/harvester/config.yaml,target=/opt/harvester/config.yaml,readonly --mount type=bind,source="$(pwd)"/data,target=/data ghcr.io/cmahnke/oai-harvester-docker/converter
If a file /opt/xslt/dafault.xsl
is provided, the container will use it to transform data.
docker run -i --mount type=bind,source="$(pwd)"/conf/config.yaml,target=/opt/harvester/config.yaml,readonly --mount type=bind,source="$(pwd)"/data,target=/data --mount type=bind,source="$(pwd)"/xslt/stylesheet.xsl,target=/opt/xslt/default.xsl ghcr.io/cmahnke/oai-harvester-docker/converter
Make sure to ommit -t
to seperate stderr
and stdout
. This way one can pipe the results directly into a file or another command for post precessing. For example jq -r '.[] | .link'
, to extract the array of results and then the property link
as a raw string.
Name | Value |
---|---|
oai_url | URL of the OAI PMH endpoint of a repository |
metadata_prefix | The matadata prefix (format) to use |
set | The requested metadata set |
target_dir | The direcory inside the container to save files to, can be mounted |
user | Username for authentification |
pass | Password for authentification |
mode | Change to 'ListIdentifiers' if you want to get each Record by listing the identifiers first |
These configuration settings can also be passed via env in upper case.
The esiest way to use the images to provide prepopulated volumes is to build your own Docker images and referencing the base images in FROM
.
The converter
image provides saxon
, jq
and curl
for transformations and downloading. These can also be used during the build of an image. The script /usr/local/sbin/cleanup.sh
removes these tools to keep the image small.
A possible Workflow would be:
- Extend the
converter
image - Provide a
config.yaml
for the desired OAI interface - Start the harvester
- Convert all metadata files
- Clean the image from tools using
cleanup.sh
The following variables are provided
Name | Value |
---|---|
HARVESTER_HOME | /opt/harvester |
SAXON_DIR | /opt/saxon |
XSLT_DIR | /opt/xslt |
DATA_DIR | /data |
Running Saxon from within the Dockerfile
is quite easy, make sure you pass the class path (-cp
) right, including $SAXON_DIR/saxon.jar
and $SAXON_DIR/xmlresolver.jar
. The following example uses an empty input XML file (empty.xml
), since the transformation works directly on a collection (DATA_DIR
).
java -Xmx1024m -cp $SAXON_DIR/saxon.jar:$SAXON_DIR/xmlresolver.jar net.sf.saxon.Transform -xsl:"/opt/xslt/extract-image-identifiers.xsl" -s:"/opt/xslt/empty.xml" collection=$DATA_DIR > /data/identifiers.lst && \
You can also use curl
to download files that might have been extracted from the metadata into a file list:
URLPREFIX=https://imageurl
while read IDENTIFIER; do
echo "Requesting $URLPREFIX$IDENTIFIER"
curl -sS -o /dev/null "$URLPREFIX$IDENTIFIER"
done < /data/identifiers.lst