/awesome-pipeline

A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin

Awesome Pipeline

A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin

Pipeline frameworks & libraries

  • ActionChain - A workflow system for simple linear success/failure workflows.
  • Adage - Small package to describe workflows that are not completely known at definition time.
  • Airflow - Python-based workflow system created by AirBnb.
  • Anduril - Component-based workflow framework for scientific data analysis.
  • Antha - High-level language for biology.
  • Bds - Scripting language for data pipelines.
  • BioMake - GNU-Make-like utility for managing builds and complex workflows.
  • BioQueue - Explicit framework with web monitoring and resource estimation.
  • Bioshake - Haskell DSL built on shake with strong typing and EDAM support
  • Bistro - Library to build and execute typed scientific workflows.
  • Bpipe - Tool for running and managing bioinformatics pipelines.
  • Briefly - Python Meta-programming Library for Job Flow Control.
  • Cluster Flow - Command-line tool which uses common cluster managers to run bioinformatics pipelines.
  • Clusterjob - Automated reproducibility, and hassle-free submission of computational jobs to clusters.
  • Compss - Programming model for distributed infrastructures.
  • Conan2 - Light-weight workflow management application.
  • Consecution - A Python pipeline abstraction inspired by Apache Storm topologies.
  • Cosmos - Python library for massively parallel workflows.
  • Cromwell - Workflow Management System geared towards scientific workflows from the Broad Institute.
  • Cuneiform - Advanced functional workflow language and framework, implemented in Erlang.
  • Dagobah - Simple DAG-based job scheduler in Python.
  • Dagr - A scala based DSL and framework for writing and executing bioinformatics pipelines as Directed Acyclic GRaphs.
  • Dask - Dask is a flexible parallel computing library for analytics.
  • Dockerflow - Workflow runner that uses Dataflow to run a series of tasks in Docker.
  • Doit - Task management & automation tool.
  • Drake - Robust DSL akin to Make, implemented in Clojure.
  • Drake R package - Reproducibility and high-performance computing with an easy R-focused interface. Unrelated to Factual's Drake.
  • Dray - An engine for managing the execution of container-based workflows.
  • Fission Workflows - A fast, lightweight workflow engine for serverless/FaaS functions.
  • Flex - Language agnostic framework for building flexible data science pipelines (Python/Shell/Gnuplot).
  • Flowr - Robust and efficient workflows using a simple language agnostic approach (R package).
  • Gc3pie - Python libraries and tools for running applications on diverse Grids and clusters.
  • Gwf - Make-like utility for submitting workflows via qsub.
  • Hive - System for creating and running pipelines on a distributed compute resource.
  • HyperLoom - Platform for defining and executing workflow pipelines in large-scale distributed environments.
  • Joblib - Set of tools to provide lightweight pipelining in Python.
  • Jug - A task Based parallelization framework for Python.
  • Ketrew - Embedded DSL in the OCAML language alongside a client-server management application.
  • Kronos - Workflow assembler for cancer genome analytics and informatics.
  • Loom - Tool for running bioinformatics workflows locally or in the cloud.
  • Longbow - Job proxying tool for biomolecular simulations.
  • Luigi - Python module that helps you build complex pipelines of batch jobs.
  • Makeflow - Workflow engine for executing large complex workflows on clusters.
  • Mara - A lightweight, opinionated ETL framework, halfway between plain scripts and Apache Airflow
  • Mario - Scala library for defining data pipelines.
  • Martian - A language and framework for developing and executing complex computational pipelines.
  • MD Studio - Microservice based workflow engine.
  • Mistral - Python based workflow engine by the Open Stack project.
  • Moa - Lightweight workflows in bioinformatics.
  • Nextflow - Flow-based computational toolkit for reproducible and scalable bioinformatics pipelines.
  • NiPype - Workflows and interfaces for neuroimaging packages.
  • OpenGE - Accelerated framework for manipulating and interpreting high-throughput sequencing data.
  • Pachyderm - Distributed and reproducible data pipelining and data management, built on the container ecosystem.
  • Parsl - Parallel Scripting Library.
  • PipEngine Ruby based launcher for complex biological pipelines.
  • Pinball - Python based workflow engine by Pinterest.
  • PyFlow - Lightweight parallel task engine.
  • PypeFlow - Lightweight workflow engine for data analysis scripting.
  • pyperator - Simple push-based python workflow framework using asyncio, supporting recursive networks.
  • pyppl - A python lightweight pipeline framework.
  • pypyr - Simple task runner for sequential steps defined in a pipeline yaml, with AWS and Slack plug-ins.
  • Pwrake - Parallel workflow extension for Rake.
  • Qdo - Lightweight high-throughput queuing system for workflows with many small tasks to perform.
  • Qsubsec - Simple tokenised template system for SGE.
  • Rabix - Python-based workflow toolkit based on the Common Workflow Language and Docker.
  • Rain - Framework for large distributed task-based pipelines, written in Rust with Python API.
  • Ray - Flexible, high-performance distributed Python execution framework.
  • Reflow - Language and runtime for distributed, incremental data processing in the cloud.
  • Remake - Make-like declarative workflows in R.
  • Rmake - Wrapper for the creation of Makefiles, enabling massive parallelization.
  • Rubra - Pipeline system for bioinformatics workflows.
  • Ruffus - Computation Pipeline library for Python.
  • Ruigi - Pipeline tool for R, inspired by Luigi.
  • Sake - Self-documenting build automation tool.
  • SciLuigi - Helper library for writing flexible scientific workflows in Luigi.
  • SciPipe - Library for writing Scientific Workflows in Go.
  • Scoop - Scalable Concurrent Operations in Python.
  • Seqtools - Python library for lazy evaluation of pipelined transformations on indexable containers.
  • Snakemake - Tool for running and managing bioinformatics pipelines.
  • Spiff - Based on the Workflow Patterns initiative and implemented in Python.
  • Stolos - Directed Acyclic Graph task dependency scheduler that simplify distributed pipelines.
  • Stpipe - File processing pipelines as a Python library.
  • Sundial - Jobsystem on AWS ECS or AWS Batch managing dependencies and scheduling.
  • Suro - Java-based distributed pipeline from Netflix.
  • Swift - Fast easy parallel scripting - on multicores, clusters, clouds and supercomputers.
  • Tibanna Tool that helps you run genomic pipelines on Amazon cloud.
  • Toil - Distributed pipeline workflow manager (mostly for genomics).
  • Yap - Extensible parallel framework, written in Python using OpenMPI libraries.
  • Wallaroo - Framework for streaming data applications and algorithms that react to real-time events.
  • WorldMake - Easy Collaborative Reproducible Computing.

Workflow platforms

  • ActivePapers - Computational science made reproducible and publishable.
  • Apache Iravata - Framework for executing and managing computational workflows on distributed computing resources.
  • Arteria - Event-driven automation for sequencing centers. Initiates workflows based on events.
  • Arvados - A container based workflow platform.
  • Biokepler - Bioinformatics Scientific Workflow for Distributed Analysis of Large-Scale Biological Data.
  • Butler - Framework for running scientific workflows on public and academic clouds.
  • Chipster - Open source platform for data analysis.
  • Clubber - Cluster Load Balancer for Bioinformatics e-Resources.
  • Digdag - Workflow manager designed for simplicity, extensibility and collaboration.
  • Fireworks - Centralized workflow server for dynamic workflows of high-throughput computations.
  • Galaxy - Web-based platform for biomedical research.
  • Kepler - Kepler scientific workflow application from University of California.
  • KNIME Analytics Platform - General-purpose platform with many specialized domain extensions.
  • NextflowWorkbench - Integrated development environment for Nextflow, Docker and Reusable Workflows.
  • OpenMOLE - Workflow Management System for exploration of models and parameter optimization.
  • Ophidia - Data-analytics platform with declarative workflows of distributed operations.
  • Pegasus - Workflow Management System.
  • Pentaho Kettle - Workflow platform with a graphical design environment.
  • Piper - Distributed workflow engine designed to be dead simple.
  • Polyaxon - A platform for machine learning experimentation workflow.
  • Reana - Platform for reusable research data analyses developed by CERN.
  • Sushi - Supporting User for SHell script Integration.
  • Yabi - Online research environment for grid, HPC and cloud computing.
  • Taverna - Domain independent workflow system.
  • VisTrails - Scientific workflow and provenance management system.
  • Wings - Semantic workflow system utilizing Pegasus as execution system.
  • Watchdog - Workflow management system for the automated and distributed analysis of large-scale experimental data.

Workflow languages

Workflow standardization initiatives

Literate programming (aka interactive notebooks)

  • Beaker Notebook-style development environment.
  • Binder - Turn a GitHub repo into a collection of interactive notebooks powered by Jupyter and Kubernetes
  • IPython A rich architecture for interactive computing.
  • Jupyter Language-agnostic notebook literate programming environment.
  • Pathomx - Interactive data workflows built on Python.
  • R Notebooks - R Markdown notebook literate programming environment.
  • SoS - Readable, interactive, cross-platform and cross-language data science workflow system.
  • Zeppelin - Web-based notebook that enables interactive data analytics.

Extract, transform, load (ETL)

  • Cadence Distributed, scalable, durable, and highly available orchestration engine developed by Uber.
  • LinkedPipes ETL - Linked Data publishing and consumption ETL tool.
  • Kiba ETL - A data processing & ETL framework for Ruby.

Application management workflows

  • Argo - Get stuff done with container-native workflows for Kubernetes.
  • Deis - Workflow system to create and manage applications on Kubernetes.

Build automation tools

  • Bazel - Build software just as engineers do at Google.
  • DoIt - Highly generalized task-management and automation in Python.
  • Gradle - Unified cross platforms builds.
  • Scons - Python library focused on C/C++ builds.
  • Shake - Define robust build systems akin to GNU Make using Haskell.
  • Make - The GNU Make build system.

Other projects

  • HPC Grid Runner
  • noWorkflow - Supporting infrastructure to run scientific experiments without a scientific workflow management system, and still get things like provenance.
  • Reprozip - Simplifies the process of creating reproducible experiments from command-line executions.

Related lists