Python tool to run docker containers in sequence, for reproducible computational analysis pipelines.
Installing and running via Docker is recommended.
-
Pull the docker-pipeline image (Optional,
docker run
will automatically pull the image)docker pull dukegcb/docker-pipeline
-
Clone this GitHub repository
git clone git@github.com:Duke-GCB/docker-pipeline.git
-
Install dependencies with pip
cd docker-pipeline pip install -r requirements.txt
Pipelines are defined by YAML files. They specify which Docker images to run, and what files/parameters to provide to containers at runtime them. Each step must specify an image, which can be a name (e.g. dleehr/filesize
), or an Image ID (10baa1fc6046
).
If the image accepts environment variables, specify these as infiles
, outfiles
, and parameters
. More on that below
name: Total Size
steps:
-
name: Get size of file 1
image: dleehr/filesize
infiles:
CONT_INPUT_FILE: /data/raw/file1
outfiles:
CONT_OUTPUT_FILE: /data/step1/size
-
name: Get size of file 2
image: dleehr/filesize
infiles:
CONT_INPUT_FILE: /data/raw/file2
outfiles:
CONT_OUTPUT_FILE: /data/step2/size
-
name: Add sizes of files
image: dleehr/add
infiles:
CONT_INPUT_FILE1: /data/step1/size
CONT_INPUT_FILE2: /data/step2/size
outfiles:
CONT_OUTPUT_FILE: /data/step3/total_size
Paths to files should be specified as absolute paths on the Docker host. See Files and Volumes for more details.
Note: You do not need to clone this repository to run docker-pipeline. You only need a working Docker installation, and the docker-pipeline image will be pulled automatically
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /path/to/total_size.yaml:/pipeline.yaml:ro \
dukegcb/docker-pipeline /pipeline.yaml
Or using docker-pipeline.sh:
docker-pipeline.sh /path/to/total_size.yaml
Since docker-pipeline creates and starts docker containers, it must have access to /var/run/docker.sock
on the host. It also must have access to the YAML file, which is mounted as /pipeline.yaml
in this example.
python pipeline.py /path/to/total_size.yaml
docker-pipeline extends YAML with custom tags, allowing simple file name operations and access to variables specified at runtime. This is handy with file names, which make little sense to hard-code in a config file.
For example, file names in the above pipeline can be replaced as such:
-
infiles:
CONT_INPUT_FILE: !var FILE1
-
infiles:
CONT_INPUT_FILE: !var FILE2
-
outfiles:
CONT_OUTPUT_FILE: !var RESULTS
And this pipeline can be run with:
python pipeline.py total_size.yaml \
FILE1=/data/raw/file1 \
FILE2=/data/raw/file2 \
RESULTS=/data/step3/total_size
The following tags are available, see tag_handlers.py for details:
!var
: replace a variable specified on the command-line!join
: concatenate multiple values into a single string. Useful for building up file paths!change_ext
: change the extension of a filename!base
: get the base file name (remove parent directories) from a path
Tags can be chained together. See test_tag_handlers.py for exmaples. In some cases.
pipeline.py uses docker-py to communicate with Docker. It has been tested with Boot2Docker on OS X, provided $(boot2docker shellinit)
has been executed. It also works on Docker hosts connecting locally.
Without explicit access provided at runtime, docker containers cannot access filesystems or paths outside the container. docker-pipeline handles this transparently, allowing you to reference files named in your pipeline, with some added safety features.
First, each container only gets volume access for directories specified within infiles
and outfiles
. docker-pipeline will mount only the innermost subdirectory when specifying volume mounts. Keep in mind that the containers do get root access to these directories, so do NOT place your data to analyze in /
or otherwise sensitive directories.
Second, volumes created for infiles
are mounted read-only, and volumes for outfiles
are mounted read-write. This prevents containers from modifying source data for their step. Of course, one step outfile
may be another container's infile
. This mechanism allows the file to be written when it's an outfile
. Ideally, raw data would only ever be passed as an infile
, so it should be protected.