/automation

Modular, Scalable Phenomic Data Processing Pipelines

Primary LanguagePythonMIT LicenseMIT

PhytoOracle | Modular, Scalable Phenomic Data Processing Pipeline

PhytoOracle Automation (POA) is general-use, distributed computing pipeline for phenomic data. POA can be run on local or HPC resources and is capable of processing large phenomic datasets such as those collected by the Field Scanner at the University of Arizona's Maricopa Agricultural Center (pictured below, Photo: Jesse Rieser for The Wall Street Journal).

POA's distributed framework, leveraging CCTools' Makeflow and Workqueue, allows users to leverage hundreds to thousands of computing cores for parallel processing of large data processing tasks. The pipeline is run using a YAML file, which specifies processing steps run by the pipeline wrapper script (distributed_pipeline_wrapper.py).

Comprehensive instructions for gantry field operations, from field preparation to phenotype information extraction, can be found here.

Required Dependencies

YAML File

For more information on YAML file key/value pairs, click here.

Arguments/Flags

For more information on arguments/flags, click here.

Required setup

iRODS

The POA workflow requires iRODS. Follow the documentation here to install iRODS.

If you are running POA on the UA HPC, iRODS is already installed so there is not need to reinstall it. Skip to section "Linux & Windows Subsystem for Linux 2 (WSL2) users", bullet # 3.

Data transfer node

If you are running POA on the UA HPC, you will need to set up SSH keys to gain access to data transfer nodes (DTNs). To get SSH keys set up, follow the steps below here

Running POA

The script distributed_pipeline_wrapper.py is used to run the pipeline. This script downloads and extracts bundled test data, runs containers, and bundles output data.

Local computer

On your computer/server, run the following command:

./distributed_pipeline_wrapper.py -d 2020-02-14 -y yaml_files/example_machinelearning_workflow.yaml

HPC cluster

There are three options when running POA on HPC clusters: interactive, non-interactice, and Cron.

Interactive

The pipeline can use a data transfer node to download data, which speeds up processing.

Interactive jobs should be run on tmux to enable a persistent connection. To install tmux on the UA HPC head node, follow the directions here.

You must first launch an interactive node using the following command on UA HPC Puma:

./shell_scripts/interactive_node.sh

Once the resources are allocated, run the following command to process data:

./distributed_pipeline_wrapper.py -hpc -d 2020-02-14 -y yaml_files/example_machinelearning_workflow.yaml

Data will be downloaded and workflows will be launched. You view progress information for a specific workflow using the mf_monitor.sh script. For example, to view progress information for the first workflow, run:

./shell_scripts/mf_monitor.sh 1

Non-interactive

To submit a date for processing in a non-interactive node, run:

sbatch shell_scripts/slurm_submission.sh <yaml_file>

For example:

sbatch shell_scripts/slurm_submission.sh yaml_files/example_machinelearning_workflow.yaml

Make sure to change the account and partition values as needed in the YAML file. For modules requiring a larger number of cores (e.g., Megastitch in the stereoTop and flirIrCamera, and ps2Top), slurm_submission_large.sh should be used.

Cron

To schedule Cron jobs, follow the directions here.