This is a curated list of resources about Apache Airflow (incubating). Please feel free to contribute any items that should be included. Items are generally added at the top of each section so that more fresh items are featured more prominently. Maintained by Jakob Homan and anyone who wants to help - pull requests are welcome or ping me on Twitter.
- Vital links
- Airflow deployment solutions
- Introductions and tutorials
- Best practices, lessons learned and cool use cases
- Blogs, etc.
- Slide deck presentations and online videos
- Libraries, Hooks, Utilities
- Meetups
- Commercial Airflow-as-a-service providers
- Non-English resources
- Official website: Apache Airflow
- Latest release: 1.9.0-incubating
- Official Twitter account: Apache Airflow
- Puckel's Docker Image - @Puckel_'s well-crafted Docker image has become the base for many Airflow installations. It is regularly updated and closely tracks the official Apache releases.
- airflow-pipeline - Airflow Docker container that comes preconfigured for Spark and Hadoop. It can be docker pulled at
datagovsg/airflow-pipeline
. - kube-airflow - This repository contains both an Airflow Docker image (that appears to have been based on Puckel's work) and Kubernetes service definition. mumoshu's repository has not been recently updated, but there are numerous forks that may be based on more recent releases.
- airflow-cookbook Chef cookbook for deploying Airflow.
- Running Airflow on top of Apache Mesos - Blog describing how to configure Mesos to run all of the Airflow componenents.
- Remote spark-submit to YARN running on EMR - Azhaguselvan walks through submitting Spark jobs to existing EMR clusters with Airflow.
- Running Airflow on top of Apache Mesos and its follow-up, Mesos, Airflow & Docker by Agraj Mangal is a quick overview of running Airflow atop Apache Mesos.
- Dustin Stansbury of Quizlet has written a four-part series that covers what workflow managers do in general, how Quizlet picked Airflow, a tour of Airflow's key concepts, and how Quizlet is now using Airflow in practice:
- Apache Airflow for the confused - This short tutorial by Jonathan Pichot takes a different tack than most by using airplane and airport operations as an analogy for Airflow.
- Integrating Apache Airflow with Databricks - While this tutorial is focused specifically on Databricks' Spark solutions, it does have a reasonable overview of Airflow basics and demonstrates how a third party solution can quickly integrate into Airflow.
- Apache Airflow as an External scheduler for distributed systems - Arunkumar suggests using Airflow as a simple external scheduler for a distributed system.
- How Sift Trains Thousands of Models using Apache Airflow - Summary of Sift Science's deployment strategy for its machine learning model pipelines.
- Apache Airflow at Pandora - Ace Haidrey discusses why Pandora chose Airflow and provides a detailed breakdown of their deployment and the infrastructure behind it.
- Airflow Lessons from the Data Engineering Front in Chicago - Alison Stanton provides a list of tips to avoid gotchas in Airflow jobs.
- Data’s Inferno: 7 Circles of Data Testing Hell with Airflow - The Wholesale Banking Advanced Analytics team at ING details how they torture test their Airflow DAGs before deployment.
- Data quality checkers - Antoine Augusti describes the framework drivy has built atop Airflow to test their datasets for completeness, consistency, timeliness, uniquess, validity and accuracy.
- Building WePay's data warehouse using BigQuery and Airflow - The inestimable Chris Riccomini describes how WePay, one of the first adopters of Airflow, integrated into their Google Cloud Compute environment.
- Using Apache Airflow to Create Data Infrastructure in the Public Sector - Despite an unfortunately very heavy sales pitch tone, this article blog post describes how ARGO Labs, a non-profit data organization, utilizes Airflow for ETLing in public sector data.
- ETL with airflow - ETL core principles and several end-to-end docker-based examples including Kimball, Data Vault on Hive and some simpler examples.
- The Airflow Podcast - A semiregular podcast discussing all things Airflow.
- Maxime Beauchemin - Maxime's blog on medium that gives insight into the philosophy behind Apache Airflow.
- Robert Chang - Blog posts about data engineering with Apache Airflow, explains why and has examples in code.
- Data Pipeline Management - Ben Goldberg walks the Chicago Kubernetes Meetup through how SpotHero uses Airflow. Additionally, Ben has a very complete slidedeck of how Airflow plays within Kubernetes.
- How I learned to time travel, or, data pipelining and scheduling with Airflow - Comprehensive deck by Laura Lorenz for why Airflow is necessary and how Industry Dive uses it.
- Introduction to Apache Airflow - Data Day Seattle 2016 - Sid Anand gives a thorough introduction to Airflow and how it was used at Agari.
- Airflow plugins - Central collection of repositories of various plugins for Airflow, including mailchimp, trello, sftp, github, etc.
- fileflow - Collection of modules to support large data transfers between Airflow operators through either local file system or S3. This addresses a gap where data is too large for XCOMs but too small or inconvenient for loading directly in the operator. Built by Industry Dive.
- fairflow - Library to abstract away Airflow's Operators with functional pieces that transform the data from one operator to another.
- airflow-maintenance-dags - Clairvoyant has a repo of Airflow DAGs that operator on Airflow itself, clearing out various bits of the backing metadata store.
- Qubole - Qubole is mainly known as a service-and-support company for Apache Hive, but also provides Airflow as a component of its platform.
- Astronomer.io - Astronomer provides complete ETL lifecycle solutions and appears to be entirely focused on providing Airflow-based products.
- Apache Airflow – Kaikki Mitä Meillä On, Lähtee Dageista [Finnish] by Olli Iivonen - Overview of Airflow, concepts and Airflow's usage at Solita