Stateless Docker Machine Learning Experiments

Here we want explore how we can use docker as a thin wrapper to run machine learning experiments in a stateless way. What do I mean by stateless? Let’s look at a stateful solution: We use some base image, e.g., and start an interactive container based off this image. We then attach a shell to this container and set up our system dependencies and project dependencies (e.g. pip requirements). To run our experiment we start some command python --foo --bar. We get some results somewhere which we need to copy to some volume mapped directory on the host. We further modify something here and there in the container. Now the running container has a certain state. Sure we can detach and attach the container and hope that the server on which the container runs is never rebooted, or that we never have to switch the server and set everything up again. In most of these cases, we end up loosing our state and the actual experiment (running python ...) becomes harder to reproduce.

A stateless setup: In a stateless docker setup, we want docker to act as a virtual environment for everything that is not (a) our code, (b) our data, and (c) our results. That is, we want docker to define the operating system, the system dependencies, the python version and environment, and the python dependencies. In docker, this can be done by defining a docker custom image in a Dockerfile:

# Select the base image

# Select the working directory

# Setup image: install system dependencies etc.
# RUN apt install ...

# Install Python requirements
COPY ./requirements.txt ./requirements.txt
RUN pip install -r requirements.txt

You can follow these steps by cloning this repository:

git clone
cd docker-stateless-ml

We start by building the docker image:

docker build -t tutorial .

To test if everything works, we can now start a container using this image:

docker run --gpus all -it --rm tutorial python -c "print('Hello World from docker')"


  • --gpus all: Give the container access to all GPUs on the host machine. Note, that for this flag we need the nvidia-container-runtime.
  • --rm: Remove the container after it is stopped since we do not care about the container state.

To make use of our project code, required data, and to store our results, we need to mount volumes into the container. We can do this using the --volume flag. The following will run the example =src/ script of the repository using data in data/ and storing results in results/:

docker run --gpus all --rm \
    --volume "$(pwd)"/src:/app/src \
    --volume "$(pwd)"/data:/data \
    --volume "$(pwd)"/results:/results \
    tutorial \
    python /app/src/


[...]  # truncated

PyTorch Version: 2.0.1+cu117
CUDA Available: True
CUDA Version: 11.7
CuDNN Version: 8500
CUDA Device Name: Tesla V100-SXM3-32GB-H
Number of CUDA Devices Available: 16
Current CUDA Device Index: 0
Reading /data/samples.csv
Writing /results/sums.csv

We can now investigate the saved result on the host (i.e., not in the docker container) in results/sums.csv:

$ cat results/sums.csv

With this, we have successfully used docker to thinly wrap our project into an environment defined in our Dockerfile. To be able to reproduce our experiment, we only need to ensure, that a different server has the project code, the data, and a build of the docker image (docker build -t turorial .). Bonus points if we are able to synchronize or symlink project, data, and result directories across servers via some shared storage.