Living Atlas Pipelines extensions
This project is proof of concept quality code aimed at identifying work required to use of pipelines as a replacement to biocache-store for data ingress.
Architecture
For details on the GBIF implementation, see the pipelines github repository. This project is focussed on extension to that architecture to support use by the living atlases.
Above is a representation of the data flow from source data in Darwin core archives supplied by data providers, to the API access to these data via the biocache-service component.
Within the "Interpreted AVRO" box is a list of "transforms" each of which take the source data and produce an isolated output in a AVRO formatted file.
GBIF's pipelines already supports a number of core transforms for handling biodiversity occurrence data. The intention is to make us of these transforms "as-is" which effectively perform the very similar functionality to what is supported by the biocache-store project (ALA's current ingress library for occurrence data).
This list of transforms will need to be added to backfill some of the ingress requirements of the ALA. These transforms will make use of existing ALA services:
- ALA Taxonomy transform - will make use of the existing ala-name-matching library
- Sensitive data - will make use of existing services in https://lists.ala.org.au to retrieve sensitive species rules.
- Spatial layers - will make use of existing services in https://spatial.ala.org.au/ws/ to retrieve sampled environmental and contextual values for geospatial points
- Species lists - will make use of existing services in https://lists.ala.org.au to retrieve species lists.
In addition pipelines for following will need to be developed:
- Duplicate detection
- Environmental outlier detection
- Expert distribution outliers
Prototyped so far:
- Pipeline extension to add the ALA taxonomy to the interpreted data
- Extension with Sampling information (Environmental & Contextual)
- Generation of search SOLR index compatible with biocache-service
To be done:
- Sensible use of GBIF's key/value store framework (backend storage to be identified)
- Dealing with sensitive data
- Integration with Collectory - ALA's current production metadata registry
- Integration with Lists tool
- Extensions with separate taxonomies e.g. NZOR
- Handling of images with ALA's image-service as storage
Dependent projects
The pipelines work will necessitate some minor additional API additions and change to the following components:
biocache-service
experimental/pipelines branch The aim for this proof of concept is to make very minimal changes to biocache-service, maintain the existing API and have no impact on existing services and applications.
ala-namematching-service
A simple drop wizard wrapper around the ala-name-matching library has been prototyped to support integration with pipelines.
Getting started
In the absence of ansible scripts, here are some instructions for setting up a local development environment for pipelines. These steps will load a dataset in SOLR.
Requirements of softwares:
- Java 8 - this is mandatory (see GBIF pipelines documentation)
- Maven needs work on OpenSDK 1.8 'nano ~/.mavenrc' add 'export JAVA_HOME= [JDK1.8 PATH]'
- Docker Desktop
- lombok plugin for intelliJ needs to be installed for slf4 annotation
Prerequisite services
- Run ala-namematching-service on port 9179 using the dock-compose file like so:
docker-compose -f ala-nameservice.yml up -d
You can test it by checking this url: http://localhost:9179/api/search?q=Acacia - Run solr on port 8983 using the dock-compose file like so:
docker-compose -f solr8.yml up -d
and then setup the collection using the following script:./update-solr-config.sh
You can test it by checking this url: http://localhost:8983
Run la-pipeline
- Download shape files from here and expand into
/data/pipelines-shp
directory - Download a darwin core archive (e.g. https://archives.ala.org.au/archives/gbif/dr893/dr893.zip) and expand it into
/data/biocache-load
e.g./data/biocache-load/dr893
- Create the following directory
/data/pipelines-data
- Build with maven
mvn clean install
- To convert DwCA to AVRO, run
./dwca-avro.sh dr893
- To interpret, run
./interpret-spark-embedded.sh dr893
- To mint UUIDs, run
./uuid-spark-embedded.sh dr893
- To sample run
./export-latlng.sh dr893
./sample.sh dr893
./sample-avro-embedded.sh dr893
- To index, run
./index.sh dr893
Integration Tests
Integration testing is supported using docker containers. The tests will To start the required containers, run the following:
docker-compose -f ala-nameservice.yml up -d
docker-compose -f solr8.yml up -d
To shutdown, run the following:
docker-compose -f ala-nameservice.yml kill
docker-compose -f solr8.yml kill
Code style and tools
For code style and tool see the recommendations on the GBIF pipelines project. In particular, note the project uses Project Lombok, please install Lombok plugin for Intellij IDEA.
avro-tools
is recommended to aid to development for quick views of AVRO outputs. This can be install on Macs with brew
brew install avro-tools