/nextflow-graph-machine-learning

A Nextflow pipeline demonstrating how to train graph neural networks for gene regulatory network reconstruction using DREAM5 data.

Primary LanguagePythonMIT LicenseMIT

Nextflow Graph Machine Learning

Validate Pipeline Generate Documentation pages-build-deployment CodeQL

Website: Nextflow Graph Machine Learning

A Nextflow pipeline demonstrating how to train graph neural networks for gene regulatory network reconstruction using DREAM5 data.

Table of contents

Introduction

The purpose of this project is to provide a simple demonstration of how to construct a Nextflow pipeline, with MLOps integration, for performing gene regulatory network (GRN) reconstruction using graph neural networks (GNNs). In practice, GRN reconstruction is an unsupervised link prediction problem.

For developing GNNs, we use PyTorch Geometric.

The Nextflow pipeline

Nextflow has been included to orchestrate the GRN reconstruction pipeline.

The pipeline is composed of the following steps:

  1. Exploratory data analysis: View the GRN and calculate some summary statistics.
  2. Processing: Process the graph feature matrix and edge list. Remove the disconnected subgraph.
  3. ArangoDB Importing: Import the graph into ArangoDB.
  4. GNN training: Train a GNN using SAGE convolutional layers.
  5. GNN training: Train a variational autoencoder GNN, and save the neural embeddings.

Run nextflow.sh to execute the full pipeline.

Run clean_nf.sh to clean up the output and logging files from the Nextflow run.

Python Environment

Python dependencies are specified in this requirements.txt file..

These dependencies are installed during the build process for the following Docker image: ghcr.io/jbris/nextflow-graph-machine-learning:1.0.0

Execute the following command to pull the image: docker pull ghcr.io/jbris/nextflow-graph-machine-learning:1.0.0

MLOps

ArangoDB

This pipeline provides a simple demonstration for saving and retrieving graph data to ArangoDB, combined with NetworkX usage and integration.