/fovt-data-pipeline

Customization of https://github.com/biocodellc/ontology-data-pipeline to work with the FuTRES Ontology for Vertebrate Traits

Primary LanguagePython

fovt-data-pipeline

The fovt-data-pipeline contains scripts to process, reason, and load data for the FuTRES project. Processed data is loaded into an ElasticSearch document store and made accessible to the FuTRES query interface and the FuTRES R package as well as populating summary statistics in driving the FuTRES website dashboard. This repository aggregates FuTRES trait data that has been loaded into GEOME as well as VertNet. Detailed instructions on loading data into GEOME, using the FuTRES team, are provided on the FuTRES website. Please note that this repository is designed to process millions of records from multiple repositories and is fairly complex. To give interested users an idea of how the reasoning steps work, we have a provided a simple start section below demonstrating how this crucial part of the process works.

Credits: This codebase draws on the Ontology Data Pipeline for triplifying and reasoning, the FuTRES Ontology for Vertebrate Traits as the source ontology, and ROBOT as a contributing library for the reasoning steps. Data processing scripts in assembling VertNet data extracts and getting legacy data ready for ingest into GEOME are stored at fovt-data-mapping

Simple Start

To quickly test the validation, triplifying and reasoning steps, you can start here. You must first checkout Ontology Data Pipeline at the same level as this repository. The following command will process the pipeline using a limited set of data and should process in a minute or two.

python ../ontology-data-pipeline/pipeline.py -v --drop_invalid  sample_data_processed.csv sample_data/output https://raw.githubusercontent.com/futres/fovt/master/fovt.owl config

Complete Process

Here we follow the complete process for processing FuTRES data. The steps below are completed sequentially with outputs from earlier steps being used as input to later steps.

STEP 0: Updating files dependent on the ontology

There are some helper files that are used in the FuTRES environment that are must be generated after the FOVT ontology is updated. This step is only necessary after the FOVT ontology is updated and not meant to be run every time data is processed. This step may be skipped if you are only trying to re-process data.

# checkout biscicol-server
cd biscicol-server/scripts
# the following script writes JSON files which drive lookup lists on the 
# FuTRES query interface as well as a GEOME output file: all_geome.json
node futresapi.ontology.sh

The next step is updating the GEOME controlled vocabulary list for measurementType. Use all_geome.json file to update GEOME team environment. Complete Instructions at geome-configurations However, here is an abbreviated method to be run in the geome-configurations repo:

curl https://api.geome-db.org/projects/configs/70 | gunzip - | python -m json.tool > 70.json
# edit 70.json file, inserting all_geome.json file into the measurementType field list
curl -X PUT -H 'Content-Type: application/json' --data "@70.json" https://api.geome-db.org/projects/configs/70?access_token=r8Fqn3w3RFVvEcWAbm-h

STEP 1: Pre-processing

The pre-processing step obtains data from remote sources and populates data tables which are then used in the reasoning step. This provides summary statistics for the FuTRES website as well as assembling all data sources into a single file in data/futres_data_processed.csv.

Installation

First, we need to setup our environment to be able to connect to remote local stores and setup our python working environment:

  • Copy dbtemp.ini to db.ini and update credentials locally
  • Ensure you are running python version of at least 3.6.8 Reccomend using pyenv to manage your environment, see https://github.com/pyenv/pyenv-virtualenv
  • pip install -r requirements.txt

Here is how to create a virtual environment specific to futres (assuming you already have setup pyenv):

# install a python version
pyenv install 3.7.2

# Create a virtual environment for futres-pyenv
pyenv virtualenv 3.7.2 futres-api

# automatically set futres-api to current directory when you navigate to this directory
pyenv local futres-api

Fetching VertNet data

Vertnet data extracts are stored in a directory called vertnet immediately off of the root directory of this repository. This directory is ignored in the .gitignore file. You will need to first copy the VertNet data extracts from the CyVerse Discovery Environment. See getDiscoveryEnvironmentData.md for instructions on coyping the VertNet data. The script will copy any CSV extension files under the vertnet directory.

Running the Script

The fetch.py script fetches data from GEOME and also looks in the VertNet directory for processed Vertnet data, populating summary statistics as JSON files, and finally creates a single file to store all processed data as data/futres_data_processed.csv. This file is used by the reasoning pipeline in Step 2 below. The fetch script is run using:

python fetch.py

The above script reports any data that has been removed from the data set during processing into an error log: data/futres_data_with_errors.csv and storing data at data/futres_data_processed.csv. Take a look at the errors file and examine the "reason" column to get the reason why the particular row was removed from the output. You may choose to go back and fix these records on the input data files and re-run the fetch.py command above.

STEP 2: Running the Reasoner

First test the environment by following the instructions under 'Simple Start' above. This will verify that things are setup correctly. Run the ontology-data-pipeline using the input data file data/futres_data_processed.csv as input data, data/output as the output directory and configuration files stored in the config directory. The following step uses our configuration files to first created a triplified view of the data in data/output/output_unreasoned, which serves as the source files for the reasoning step, which are stored in data/output/output_reasoned. The output files from the reasoning step are then processed using SPARQL to write files intout data/output/output_reasoned_csv

python ../ontology-data-pipeline/pipeline.py -v --drop_invalid  data/futres_data_processed.csv data/output https://raw.githubusercontent.com/futres/fovt/master/fovt.owl config

*NOTE 1: you must reference your input data file to reason within the root-level heirarchicy of this repository. We have provided the data/ directory for putting input and output data files, although you can use any directory under the root.

*NOTE 2: examine the log output from running the above command to see if any files failed to execute. You may get a message that one or more files failed at a particular step. If this happens, you can often re-run the reasoner by copying the failed command and re-running (making sure to re-run subsequent steps as well for the failed file, which are listed in the log output). If you still get an error you can add the --vvv option to the reasoning command to get detailed log output which often provides clues to why reasoning failed. This looks, for example, like: java -Xmx2048m -jar /Users/jdeck/IdeaProjects/ontology-data-pipeline/process/../lib/robot.jar reason -r elk --vvv --axiom-generators "\"InverseObjectProperties ClassAssertion\"" -i data/output/output_unreasoned/data_7.ttl --include-indirect true --exclude-tautologies structural reduce -o data/output/output_reasoned/data_7.ttl Most often, failed reasoning has to do with unescaped special characters in data properties. These special characters should either be escaped or removed and requiring the whole process to be re-run.

STEP 3: Loading Data Into Document Store

The loader.py script populates the elasticsearch document store using the loader.py script. The elastic search loader references the host, index, and directory to search for files directly in the script. In cases where this repository is forked, these values can be changed directly in code.

OPTIONAL: Since the size of the data can be quite large and the loader.py script sends uncompressed data, we probably want to first send the files to a remote server that has excellent bandwidth from our desktop machine. This command would look like:

# replace server IP and exouser with your server and username
tar zcvf - data/output/output_reasoned_csv/* | ssh exouser@149.165.159.216  "cd /home/exouser/data/futres; tar xvzf -"

Once your data is transfered to the server that you wish to load from, you can execute the following command, which looks for data in data/output/output_reasoned_csv/data*.csv. Note that if you copied your data to another server, as we did in the previous command, you will also need to check out fovt-data-pipeline on that server to run the next command. You will first want to edit loader.py and change the data_dir variable near the end of the script to the directory on your computer where the output is stored. This command requires access to your remote document store.

python loader.py

STEP 4: API Proxy updates

The repository biscicol-server has additional functions for serving the loaded FuTRES data living at the https://futres.org/ website, including:

  • updating fovt ontology lookups (with links to updating GEOME Controlled Vocabs) and dynamic links for generating ontology lookup lists for the FuTRES website
  • a nodejs script, under scripts/futres.fetchall.js for bundling all of FuTRES script into a single zip archive, handy for R work where you want to look at all of FuTRES data, this script is run like: You will first need to clone biscicol-server
cd biscicol-server  
cd scripts
node futres.fetchall.js

Application Programming Interface

This repository generates files in the pre-processing step which serve as an API. These files are referenced at [https://github.com/futres/fovt-data-pipeline/blob/master/api.md]. In addition to this datasource, there is a dynamic data service which references files that were loaded into elasticsearch in the "Loading Data" step, above. The FuTRES dynamic data is hosted by the plantphenology nodejs proxy service at: https://github.com/biocodellc/ppo-data-server/blob/master/docs/es_futres_proxy.md The following endpoints to that datastore are: