/awesome-pipeline

A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin

Awesome Pipeline

A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin

Pipeline frameworks & libraries

  • ActionChain - A workflow system for simple linear success/failure workflows.
  • Adage - Small package to describe workflows that are not completely known at definition time.
  • AiiDA - workflow manager with a strong focus on provenance, performance and extensibility.
  • Airflow - Python-based workflow system created by AirBnb.
  • Anduril - Component-based workflow framework for scientific data analysis.
  • Antha - High-level language for biology.
  • AWE - Workflow and resource management system with CWL support.
  • Balsam - Python-based high throughput task and workflow engine.
  • Bds - Scripting language for data pipelines.
  • BioMake - GNU-Make-like utility for managing builds and complex workflows.
  • BioQueue - Explicit framework with web monitoring and resource estimation.
  • Bioshake - Haskell DSL built on shake with strong typing and EDAM support.
  • Bistro - Library to build and execute typed scientific workflows.
  • Bpipe - Tool for running and managing bioinformatics pipelines.
  • Briefly - Python Meta-programming Library for Job Flow Control.
  • Cluster Flow - Command-line tool which uses common cluster managers to run bioinformatics pipelines.
  • Clusterjob - Automated reproducibility, and hassle-free submission of computational jobs to clusters.
  • Compi - Application framework for portable computational pipelines.
  • Compss - Programming model for distributed infrastructures.
  • Conan2 - Light-weight workflow management application.
  • Consecution - A Python pipeline abstraction inspired by Apache Storm topologies.
  • Cosmos - Python library for massively parallel workflows.
  • Couler - Unified interface for constructing and managing workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.
  • Covalent - Workflow orchestration toolkit for high-performance and quantum computing research and development.
  • Cromwell - Workflow Management System geared towards scientific workflows from the Broad Institute.
  • Cuneiform - Advanced functional workflow language and framework, implemented in Erlang.
  • Cylc - A workflow engine for cycling systems, originally developed for operational environmental forecasting.
  • Dagobah - Simple DAG-based job scheduler in Python.
  • Dagr - A scala based DSL and framework for writing and executing bioinformatics pipelines as Directed Acyclic Graphs.
  • Dagster - Python-based API for defining DAGs that interfaces with popular workflow managers for building data applications.
  • DataJoint - an open-source relational framework for scientific data pipelines.
  • Dask - Dask is a flexible parallel computing library for analytics.
  • Dbt - Framework for writing analytics workflows entirely in SQL. The T part of ETL, focuses on analytics engineering.
  • Dockerflow - Workflow runner that uses Dataflow to run a series of tasks in Docker.
  • Doit - Task management & automation tool.
  • Drake - Robust DSL akin to Make, implemented in Clojure.
  • Drake R package - Reproducibility and high-performance computing with an easy R-focused interface. Unrelated to Factual's Drake. Succeeded by Targets.
  • Dray - An engine for managing the execution of container-based workflows.
  • ecFlow - Workflow manager.
  • eHive - System for creating and running pipelines on a distributed compute resource.
  • Fission Workflows - A fast, lightweight workflow engine for serverless/FaaS functions.
  • Flex - Language agnostic framework for building flexible data science pipelines (Python/Shell/Gnuplot).
  • Flowr - Robust and efficient workflows using a simple language agnostic approach (R package).
  • Gc3pie - Python libraries and tools for running applications on diverse Grids and clusters.
  • Guix Workflow Language - A workflow management language extension for GNU Guix.
  • Gwf - Make-like utility for submitting workflows via qsub.
  • Hamilton - A python micro-framework for describing dataflows; runs anywhere python runs.
  • HyperLoom - Platform for defining and executing workflow pipelines in large-scale distributed environments.
  • Joblib - Set of tools to provide lightweight pipelining in Python.
  • Jug - A task Based parallelization framework for Python.
  • Kedro - Workflow development tool that helps you build data pipelines.
  • Kestra - Open source data orchestration and scheduling platform with declarative syntax.
  • Ketrew - Embedded DSL in the OCAML language alongside a client-server management application.
  • Kronos - Workflow assembler for cancer genome analytics and informatics.
  • Loom - Tool for running bioinformatics workflows locally or in the cloud.
  • Longbow - Job proxying tool for biomolecular simulations.
  • Luigi - Python module that helps you build complex pipelines of batch jobs.
  • Maestro - YAML based HPC workflow execution tool.
  • Makeflow - Workflow engine for executing large complex workflows on clusters.
  • Mara - A lightweight, opinionated ETL framework, halfway between plain scripts and Apache Airflow.
  • Mario - Scala library for defining data pipelines.
  • Martian - A language and framework for developing and executing complex computational pipelines.
  • MD Studio - Microservice based workflow engine.
  • MetaFlow - Open-sourced framework from Netflix, for DAG generation for data scientists. Python and R API's.
  • Mistral - Python based workflow engine by the Open Stack project.
  • Moa - Lightweight workflows in bioinformatics.
  • Nextflow - Flow-based computational toolkit for reproducible and scalable bioinformatics pipelines.
  • NiPype - Workflows and interfaces for neuroimaging packages.
  • OpenGE - Accelerated framework for manipulating and interpreting high-throughput sequencing data.
  • Pachyderm - Distributed and reproducible data pipelining and data management, built on the container ecosystem.
  • Parsl - Parallel Scripting Library.
  • PipEngine - Ruby based launcher for complex biological pipelines.
  • Pinball - Python based workflow engine by Pinterest.
  • Popper - YAML based container-native workflow engine supporting Docker, Singularity, Vagrant VMs with Docker daemon in VM, and local host.
  • Porcupine - Haskell workflow tool to express and compose tasks (optionally cached) whose datasources and sinks are known ahead of time and rebindable, and which can expose arbitrary sets of parameters to the outside world.
  • Prefect Core - Python based workflow engine powering Prefect.
  • Pydra - Lightweight, DAG-based Python dataflow engine for reproducible and scalable scientific pipelines.
  • PyFlow - Lightweight parallel task engine.
  • pyperator - Simple push-based python workflow framework using asyncio, supporting recursive networks.
  • pyppl - A python lightweight pipeline framework.
  • pypyr - Automation task-runner for sequential steps defined in a pipeline yaml, with AWS and Slack plug-ins.
  • Pwrake - Parallel workflow extension for Rake.
  • Qdo - Lightweight high-throughput queuing system for workflows with many small tasks to perform.
  • Qsubsec - Simple tokenised template system for SGE.
  • Rabix - Python-based workflow toolkit based on the Common Workflow Language and Docker.
  • Rain - Framework for large distributed task-based pipelines, written in Rust with Python API.
  • Ray - Flexible, high-performance distributed Python execution framework.
  • Redun - Yet another redundant workflow engine.
  • Reflow - Language and runtime for distributed, incremental data processing in the cloud.
  • Remake - Make-like declarative workflows in R.
  • Rmake - Wrapper for the creation of Makefiles, enabling massive parallelization.
  • Rubra - Pipeline system for bioinformatics workflows.
  • Ruffus - Computation Pipeline library for Python.
  • Ruigi - Pipeline tool for R, inspired by Luigi.
  • Sake - Self-documenting build automation tool.
  • SciLuigi - Helper library for writing flexible scientific workflows in Luigi.
  • SciPipe - Library for writing Scientific Workflows in Go.
  • Signac - Lightweight, but scalable framework for file-driven workflows to be run locally and on HPC systems.
  • Scoop - Scalable Concurrent Operations in Python.
  • Seqtools - Python library for lazy evaluation of pipelined transformations on indexable containers.
  • Snakemake - Tool for running and managing bioinformatics pipelines.
  • Spiff - Based on the Workflow Patterns initiative and implemented in Python.
  • Stolos - Directed Acyclic Graph task dependency scheduler that simplify distributed pipelines.
  • Steppy - lightweight, open-source, Python 3 library for fast and reproducible experimentation.
  • Stpipe - File processing pipelines as a Python library.
  • StreamFlow - Container native workflow management system focused on hybrid workflows.
  • StreamPipes - A self-service IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams.
  • Sundial - Jobsystem on AWS ECS or AWS Batch managing dependencies and scheduling.
  • Suro - Java-based distributed pipeline from Netflix.
  • Swift - Fast easy parallel scripting - on multicores, clusters, clouds and supercomputers.
  • Targets - Dynamic, function-oriented Make-like reproducible pipelines at scale in R.
  • TaskGraph - A library to help manage complicated computational software pipelines consisting of long running individual tasks.
  • Temporal - Temporal is a microservice orchestration platform which enables developers to build scalable applications without sacrificing productivity or reliability.
  • Tibanna - Tool that helps you run genomic pipelines on Amazon cloud.
  • Toil - Distributed pipeline workflow manager (mostly for genomics).
  • Yap - Extensible parallel framework, written in Python using OpenMPI libraries.
  • Yapp - A C++ parallel pipeline library for stream processing.
  • Wallaroo - Framework for streaming data applications and algorithms that react to real-time events.
  • WorldMake - Easy Collaborative Reproducible Computing.
  • Zenaton - Workflow engine for orchestrating jobs, data and events across your applications and third party services.
  • ZenML - Extensible open-source MLOps framework to create reproducible pipelines for data scientists.

Workflow platforms

  • ActivePapers - Computational science made reproducible and publishable.
  • Apache Iravata - Framework for executing and managing computational workflows on distributed computing resources.
  • Arteria - Event-driven automation for sequencing centers. Initiates workflows based on events.
  • Arvados - A container based workflow platform.
  • Biokepler - Bioinformatics Scientific Workflow for Distributed Analysis of Large-Scale Biological Data.
  • Butler - Framework for running scientific workflows on public and academic clouds.
  • Chipster - Open source platform for data analysis.
  • Clubber - Cluster Load Balancer for Bioinformatics e-Resources.
  • Digdag - Workflow manager designed for simplicity, extensibility and collaboration.
  • Domino - User friendly and open source visual workflow management platform.
  • Fireworks - Centralized workflow server for dynamic workflows of high-throughput computations.
  • Flyte - Container-native, type-safe workflow and pipelines platform for large scale processing and ML.
  • Galaxy - Web-based platform for biomedical research.
  • Kepler - Kepler scientific workflow application from University of California.
  • KNIME Analytics Platform - General-purpose platform with many specialized domain extensions.
  • NextflowWorkbench - Integrated development environment for Nextflow, Docker and Reusable Workflows.
  • omega|ml DataOps Platform - Data & model pipeline deployment for humans - integrated, scalable, extensible.
  • OpenMOLE - Workflow Management System for exploration of models and parameter optimization.
  • Ophidia - Data-analytics platform with declarative workflows of distributed operations.
  • Orchest - An IDE for Data Science.
  • Pegasus - Workflow Management System.
  • Piper - Distributed workflow engine designed to be dead simple.
  • Polyaxon - A platform for machine learning experimentation workflow.
  • Reana - Platform for reusable research data analyses developed by CERN.
  • Sushi - Supporting User for SHell script Integration.
  • Yabi - Online research environment for grid, HPC and cloud computing.
  • Taverna - Domain independent workflow system.
  • Temporal - Highly scalable developer oriented Workflow as Code engine.
  • VisTrails - Scientific workflow and provenance management system.
  • Wings - Semantic workflow system utilizing Pegasus as execution system.
  • Watchdog - Workflow management system for the automated and distributed analysis of large-scale experimental data.
  • FlowHub - FlowHub is a new workflow cloud platform.

Workflow languages

Workflow standardization initiatives

ETL & Data orchestration

  • DataLad - git and git-annex based data version control system with lightweight provenance capture/re-execution support.
  • DVC - Data version control system for ML project with lightweight pipeline support.
  • lakeFS - Repeatable, atomic and versioned data lake on top of object storage.
  • Nessie - Provides Git-like capability & version control for Iceberg Tables, Delta Lake Tables & SQL Views.

Literate programming (aka interactive notebooks)

  • Beaker Notebook-style development environment.
  • Binder - Turn a GitHub repo into a collection of interactive notebooks powered by Jupyter and Kubernetes
  • IPython A rich architecture for interactive computing.
  • Jupyter Language-agnostic notebook literate programming environment.
  • Pathomx - Interactive data workflows built on Python.
  • Polynote - A better notebook for Scala (and more). Built by Netflix.
  • Ploomber - Consolidate your notebooks and scripts in a reproducible pipeline using a pipeline.yaml file
  • R Notebooks - R Markdown notebook literate programming environment.
  • RedPoint Notebooks - Web-native computational notebook for programmers supporting multiple languages, APIs and webooks.
  • SoS - Readable, interactive, cross-platform and cross-language data science workflow system.
  • Zeppelin - Web-based notebook that enables interactive data analytics.

Extract, transform, load (ETL)

  • Cadence Distributed, scalable, durable, and highly available orchestration engine developed by Uber.
  • Dataform - Dataform is a framework for managing SQL based operations in your data warehouse.
  • Kiba ETL - A data processing & ETL framework for Ruby.
  • LinkedPipes ETL - Linked Data publishing and consumption ETL tool.
  • Pentaho Kettle - A plataform that delivers poweful ETL capabilities, using a groundbreaking, metadata-driven approach.
  • Substation - Substation is a cloud native data pipeline and transformation toolkit written in Go.

Continuous Delivery workflows

  • Argo - Get stuff done with container-native workflows for Kubernetes.
  • CDS - A pipeline based Continuous Delivery Service written in Golang.

Build automation tools

  • Bazel - Build software just as engineers do at Google.
  • DoIt - Highly generalized task-management and automation in Python.
  • Gradle - Unified cross platforms builds.
  • Scons - Python library focused on C/C++ builds.
  • Shake - Define robust build systems akin to GNU Make using Haskell.
  • Make - The GNU Make build system.
  • Prodmodel - Build system for data science pipelines.

Automated workflow composition

  • APE - A tool for the automated exploration of possible computational workflows based on semantic annotations.

Other projects

  • HPC Grid Runner
  • NiFi - Powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
  • noWorkflow - Supporting infrastructure to run scientific experiments without a scientific workflow management system, and still get things like provenance.
  • Reprozip - Simplifies the process of creating reproducible experiments from command-line executions.

Related lists