fastmachinelearning/gw-iaas

Simplify and organize container builds

Opened this issue · 2 comments

Right now, the structure for container builds for individual projects is to have the project Dockerfile live in the project's root directory, then to use the root of the whole repository as the build context in order to copy local dependencies into the image for installation via poetry. These dependencies need to be added explicitly in multiple COPY statements, including the code for the project itself. The advantages here are that

  • Dockerfiles get to live with the applications they're intended to execute, keeping things organized
  • Rebuilding is only required when one of the depency directories change
  • Depency code can be volume mapped from the host into the container at runtime for easier development
  • Each project can use the Poetry/Python version required for its purposes (also potentially a disadvantage, see below)

However, the disadvantages are that

  • Individual Dockerfiles are less clear, since the COPY statements will be relative to the build context root and not the directory containing the Dockerfile (which is not obvious unless you inspect the CI yamls)
  • Specifying each project as a dependency to itself is redundant. Even having to specify the local dependencies is redundant since technically we should already know these from the project's pyproject.toml or poetry.lock.
  • Images are needlessly bloated by requiring that all the source code be added and live with container forever, even though in production all we need are the built libraries wheels
  • As the code base grows, sending the entire repo as the build context to the Docker engine could become really onerous
  • No guarantees that projects are built against the same Poetry and Python versions

Possible Solutions

Use Makefiles as outlined here

Advantages

  • Necessary dependencies and applications themselves are installed into containers automatically
  • make insures that rebuilds only happen when the relevant libraries change
  • Build contexts are isolated to project directories, reducing the size of the context
  • Only copying built libraries reduces the size of the image
  • Projects are all built against the same (local to build) Python and Poetry versions

Disadvantages

  • Extra dependency on make and familiarity with Makefile syntax
  • Makefile syntax in tools makes certain assumptions about the relative directory depths of applications and libraries
  • Dockerfiles depend on products of local builds, defeating the purpose of isolated container environments
  • Dockerfiles provide almost no clarity about what's going into them

Global base image, project-specific base and build images

Build begins with a global build image which adds libraries and installs the desired Poetry version

ARG PYTHON_TAG
FROM python:${PYTHON_TAG}
ARG POETRY_VERSION
RUN python -m pip install ${POETRY_VERSION}
COPY libs /opt/gw-iaas/libs

built by

docker build -t build .

Then for individual projects, build starts with a python script that builds all dependency wheels via something like (making docker a dependency in the root pyproject.toml:

import argparse
import re
import pathlib
from io import StringIO

import docker


parser = argparse.ArgumentParser()
parser.add_argument("--project", required=True, type=str, help="Path to project")
args = parser.parse_args()
project = pathlib.Path(args.project)

dockerfile = """
FROM build
COPY . /opt/build
RUN set +x \
        \
        && mkdir /opt/lib \
"""

with open(project / "poetry.lock", "r") as f:
    lockfile = f.read()

def add(line):
    dockerfile += start + "\\"
    dockerfile += start + f"&& {line} \\"

start = "\n" + " " * 8
root = "/opt/gw-iaas/libs"
for dep in re.findall("<regex for local deps>", lockfile):
    add("cd {root}/{dep}")
    add("poetry build")
    add("cp dist/*.whl /opt/lib")

add("cd /opt/build")
add("poetry build")
dockerfile = dockerfile[:-2]

client = docker.from_env()
build_image, _ = client.images.build(
    fileobj=StringIO(dockerfile),
    tag=f"{project}:build"
)

client.images.build(
    path=project / "Dockerfile",
    tag=str(project)
)

client.images.remove(build_image)

then individual project Dockerfiles would include a line

COPY --from=<project>:build /opt/lib/*.whl .
RUN pip install *.whl && rm *.whl

Advantages

  • Unifies and isolates Poetry and Python environments used for builds
  • Automates addition of dependencies and project code
  • Project builds have local contexts and COPY paths are relative to Dockerfile location
  • Only installing wheels reduces the size of images

Disadvantages

  • Addition of extra host dependencies
  • Easy for CI, but local builds become more complicated (could solve with a Makefile?)
  • Haven't tested this so no idea if it will actually work
  • Python script obscures what's going into container, makes builds less reproducible (Python script dependent on host environment)

Draft PR tries to create a hybrid of these solutions by doing Makefile install in intermediate image at the top of project-specific Dockerfiles using a global build container, adding three lines of boiler plate that I can live with (the monorepo example mentioned above actually does something like this in tools/cloudbuild.yaml).

Currently running into final wheel install issues stemming from poetry/issues#1168. Will keep monitoring this and trying to come up with workarounds as time permits. Monorepo code manages to get around this, but it's not clear to me how.

Note that while draft PR contains Python code for reference, this likely won't be part of final PR both because its unnecessary given the current build structure and is probably bad Docker practice.