Fast iterative local development and testing of Apache Airflow workflows.
The idea of whirl is pretty simple: use Docker containers to start up Apache Airflow and the other components used in your workflow. This gives you a copy of your production environment that runs on your local machine. You can run your DAG locally from start to finish - with the same code as in production. Seeing your pipeline succeed gives you more confidence about the logic you are creating/refactoring and how it integrates with other components. It also gives new developers an isolated environment for experimenting with your workflows.
whirl connects the code of your DAG and your (mock) data to the Apache Airflow container that it spins up. Using volume mounts you are able to make changes to your code in your favorite IDE and immediately see the effect in the running Apache Airflow UI on your machine. This also works with custom Python modules that you are developing and using in your DAGs.
NOTE: whirl is not intended to replace proper (unit) testing of the logic you are orchestrating with Apache Airflow.
whirl relies on Docker and Docker Compose. Make sure you have it installed. If using Docker for Mac or Windows ensure that you have configured it with sufficient RAM (8GB or more recommended) for running all your containers.
When you want to use whirl in your CI pipeline (currently work in progress), you need to have jq
installed. For example, with Homebrew:
brew install jq
The current implementation was developed on macOS but is intended to work with any platform supported by Docker. In our experience, Linux and macOS are fine. You can run it on native Windows 10 using WSL. Unfortunately, Docker on Windows 10 (version 1809) is hamstrung because it relies on Windows File Sharing (CIFS) to establish the volume mounts. Airflow hammers the volume a little harder than CIFS can handle, and you'll see intermittent FileNotFound errors in the volume mount. This may improve in the future. For now, running whirl inside a Linux VM in Hyper-V gives more reliable results.
Clone this repository:
git clone https://github.com/godatadriven/whirl.git <target directory of whirl>
For ease of use you can add the base directory to your PATH
environment variable
export PATH=<target directory of whirl>:${PATH}
The whirl
script is used to perform all actions.
$ whirl -h
$ whirl --help
The default action is to start the DAG in your current directory. It expects an environment to be configured. You can pass this as a command line argument or you can configure it in a .whirl.env
file. (See #Configuring environment variables.) The environment refers to a directory with the same name in the envs
directory located near the whirl script.
$ whirl [start] [-d <directory>] [-e <environment>]
Specifying the start
command line argument is a more explicit way to start whirl.
$ whirl stop [-d <directory>] [-e <environment>]
Stops the configured environment.
If you want to stop all containers from a specific environment you can add the -e
or --environment
commandline argument with the name of the environment. This name corresponds with a directory in the envs
directory.
We do not currently have a complete example of how to usage whirl as part of a CI pipeline. However the first step in doing this is involves starting while in ci
mode. This will:
- run the Docker containers daemonized in the background;
- ensure the DAG(s) are unpaused; and
- wait for the pipeline to either succeed or fail.
Upon success the containers will be stopped and exit successfully.
At present we don't exit upon failure because it can be useful to be able to inspect the environment to see what happened. In the future we plan to print out the logs of the failed task and cleanup before indicating the pipeline has failed.
Instead of using the environment option each time you run whirl, you can also configure your environment in a .whirl.env
file. This can be in three places. They are applied in order:
- A
.whirl.env
file in the root this repository. This can also specify a default environment to be used when starting whirl. You do this by setting theWHIRL_ENVIRONMENT
which references a directory in theenvs
folder. This repository contains an example you can modify. It specifies the defaultPYTHON_VERSION
to be used in any environment. - A
.whirl.env
file in yourenvs/{your-env}
subdirectory. The environment directory to use can be set by any of the other.whirl.env
files or specified on the commandline. This is helpful to set environment specific variables. Of course it doesn't make much sense to set theWHIRL_ENVIRONMENT
here. - A
.whirl.env
in your DAG directory to override any environment variables. This can be useful for example to overwrite the (default)WHIRL_ENVIRONMENT
.
Inside the whirl script the following environment variables are set:
Environment Variable | Value | Description |
---|---|---|
DOCKER_CONTEXT_FOLDER |
${SCRIPT_DIR}/docker |
Base build context folder for Docker builds referenced in Docker Compose |
ENVIRONMENT_FOLDER |
${SCRIPT_DIR}/envs/<environment> |
Base folder for environment to start. Contains docker-compose.yml and environment specific preparation scripts. |
DAG_FOLDER |
$(pwd) |
Current working directory. Used as Airflow DAG folder. Can contain preparation scripts to prepare for this specific DAG. |
PROJECTNAME |
$(basename ${DAG_FOLDER}) |
This project is based on docker-compose and the notion of different environments where Airflow is a central part. The rest of the environment depends on the tools/setup of the production environment used in your situation.
The whirl script combines the DAG and the environment to make a fully functional setup.
To accommodate different examples:
- The environments are split up into separate environment-specific directories inside the
envs/
directory. - The DAGS are split into sub-directories in the
examples/
directory.
Environments use Docker Compose to start containers which together mimic your production environment. The basis of the environment is the docker-compose.yml
file which as a minimum declares the Airflow container to run. Extra tools (e.g. s3
, sftp
) can be linked together in the docker-compose file to form your specific environment.
Each environment also contains some setup code needed for Airflow to understand the environment, for example Connections
and Variables
. Each environment has a whirl.setup.d/
directory which is mounted in the Airflow container. On startup all scripts in this directory are executed. This is a location for installing and configuring extra client libraries that are needed to make the environment function correctly; for example awscli
if S3 access is required.
The DAGs in this project are inside the examples/
directory. In your own project you can have your code in its own location outside this repository.
Each example directory consists of at least one example DAG. Also project- specific code can be placed there. As with the environment the DAG directory can contain a whirl.setup.d/
directory which is also mounted inside the Airflow container. Upon startup all scripts in this directory are executed. The environment-specific whirl.setup.d/
is executed first, followed by the DAG one.
This is also a location for installing and configuring extra client libraries that are needed to make the DAG function correctly; for example a mock API endpoint.
This repository contains some example environments and workflows. The components used might serve as a starting point for your own environment. If you have a good example you'd like to add, please submit a merge request!
The first example environment only involves one component, the Apache Airflow docker container itself. The environment contains one preparation script called 01_enable_local_ssh.sh
which makes it possible in that container to SSH to localhost
. The script also adds a new connection called ssh_local
to the Airflow connections.
To run this example:
$ cd ./examples/localhost-ssh-example
$ whirl -e local-ssh
Open your browser to http://localhost:5000 to access the Airflow UI. Manually enable the DAG and watch the pipeline run to successful completion.
In this example we are going to:
- Consume a REST API;
- Convert the JSON data to Parquet;
- Store the result in a S3 bucket.
The environment includes containers for:
- A S3 server;
- A MockServer instance
- The core Airflow component.
The environment contains a setup script in the whirl.setup.d/
folder:
-
01_add_connection_api.sh
which:- Adds a S3 connection to Airflow;
- Installs the
awscli
Python libraries and configures them to connect to the S3 server; - Creates a bucket (with a
/etc/hosts
entry to support the virtual host style method).
To run this example:
$ cd ./examples/api-python-s3
$ whirl
Open your browser to http://localhost:5000 to access the Airflow UI. Manually enable the DAG and watch the pipeline run to successful completion.
This example includes a .whirl.env
configuration file in the DAG directory. In the environment folder there is also a .whirl.env
which specifies S3-specific variables. The example folder also contains a whirl.setup.d/
directory which contains an initialization script (01_add_connection_api_and_mockdata.sh
). This script is executed in the container after the environment-specific scripts have run and will:
- Add a connection to the API endpoint;
- Add an expectation for the MockServer to know which response needs to be sent for which requested path;
- Install Pandas and PyArrow to support transforming the JSON into a Parquet file;
- Create a local directory where the intermediate file is stored before being uploaded to S3.
This example includes containers for:
- A SFTP server;
- A MySQL instance;
- The core Airflow component.
The environment contains two startup scripts in the whirl.setup.d/
folder:
01_prepare_sftp.sh
which adds a SFTP connection to Airflow;02_prepare_mysql.sh
which adds a MySQL connection to Airflow.
To run this example:
$ cd ./examples/sftp-mysql-example
$ whirl
Open your browser to http://localhost:5000 to access the Airflow UI. Manually enable the DAG and watch the pipeline run to successful completion.
The environment to be used is set in the .whirl.env
in the DAG directory. In the environment folder there is also a .whirl.env
which specifies how MOCK_DATA_FOLDER
is set. The DAG folder also contains a whirl.setup.d/
directory which contains the script 01_cp_mock_data_to_sftp.sh
. This script gets executed in the container after the environment specific scripts have run and will do a couple of things:
- It will rename the file
mocked-data-#ds_nodash#.csv
that is in the./mock-data/
folder. It will replace#ds_nodash#
with the same value that Apache Airflow will use when templatingds_nodash
in the Python files. This means we have a file available for our specific DAG run. (The logic to rename these files is located in/etc/airflow/functions/date_replacement.sh
in the Airflow container.) - It will copy this file to the SFTP server, where the DAG expects to find it. When the DAG starts it will try to copy that file from the SFTP server to the local filesystem.
In this example the dag is not the most important part. This example is all about how to configure airflow to log to S3.
We have created an environment that spins up an S3 server together with the Airflow one. The environment contains a setup script in the whirl.setup.d
folder:
01_add_connection_s3.sh
which:- adds an S3 connection to Airflow
- Installs awscli Python libraries and configures them to connect to the S3 server
- Creates a bucket (with adding a
/etc/hosts
entry to support the virtual host style method)
02_configue_logging_to_s3.sh
which:- exports environment varibles which airflow uses to override the default config. For example:
export AIRFLOW__CORE__REMOTE_LOGGING=True
- exports environment varibles which airflow uses to override the default config. For example:
To run the corresponding example DAG, perform the following (assuming you have put whirl to your PATH
)
$ cd ./examples/logging-to-s3
$ whirl
Open your browser to http://localhost:5000 to see the Airflow UI appear. Manually enable the DAG and see the pipeline get marked success. If you open one of the logs, the first line shows that the log is retrieved from S3.
The environment to be used is set in the .whirl.env
in the DAG directory. In the environment folder there is also a .whirl.env
which specifies S3 specific variables.
In this example the dag is not the most important part. This example is all about how to configure airflow to use a external database. We have created an environment that spins up an postgres database server together with the Airflow one.
To run the corresponding example DAG, perform the following (assuming you have put whirl to your PATH
)
$ cd ./examples/external-airflow-db
$ whirl
Open your browser to http://localhost:5000 to see the Airflow UI appear. Manually enable the DAG and see the pipeline get marked success.
The environment to be used is set in the .whirl.env
in the DAG directory. In the environment folder there is also a .whirl.env
which specifies Postgres specific variables.
In this example the dag is set to fail. This example is all about how to configure airflow to use a external smtp server for sending the failure emails. We have created an environment that spins up an smtp server together with the Airflow one.
To run the corresponding example DAG, perform the following (assuming you have put whirl to your PATH
)
$ cd ./examples/external-smtp-for-failure-emails
$ whirl
Open your browser to http://localhost:5000 to see the Airflow UI appear. Manually enable the DAG and see the pipeline get marked failed. Also open your browser at http://localhost:1080 for the email client where the emails should show up.
The environment to be used is set in the .whirl.env
in the DAG directory. In the environment folder there is also a .whirl.env
which specifies specific Airflow configuration variables.
An early version of whirl was brought to life at ING. Bas Beelen gave a presentation describing how whirl was helpful in their infrastructure during the 2nd Apache Airflow Meetup, January 23 2019, hosted at Google London HQ.