/ml-containers

Primary LanguageDockerfileMIT LicenseMIT

ml-containers

Repository for building ML images at CoreWeave

Index

See the list of all published images.

Special PyTorch Images:

PyTorch Base Images

CoreWeave provides custom builds of PyTorch, torchvision and torchaudio tuned for our platform in a single container image, ml-containers/torch.

Versions compiled against CUDA 11.8.0, 12.0.1, 12.1.1, and 12.2.2 are available in this repository, with two variants:

  1. base: Tagged as ml-containers/torch:a1b2c3d-base-....
    1. Built from nvidia/cuda:...-base-ubuntu22.04 as a base.
    2. Only includes essentials (CUDA, torch, torchvision, torchaudio), so it has a small image size, making it fast to launch.
  2. nccl: Tagged as ml-containers/torch:a1b2c3d-nccl-....
    1. Built from ghcr.io/coreweave/nccl-tests as a base.
    2. Ultimately inherits from nvidia/cuda:...-cudnn8-devel-ubuntu22.04.
    3. Larger, but includes development libraries and build tools such as nvcc necessary for compiling other PyTorch extensions.
    4. These PyTorch builds are built on component libraries optimized for the CoreWeave cloud—see coreweave/nccl-tests.

Note

Most torch images have both a variant built on Ubuntu 22.04 and a variant built on Ubuntu 20.04.

  • CUDA 11.8.0 is an exception, and is only available on Ubuntu 20.04.
  • Ubuntu 22.04 images use Python 3.10.
  • Ubuntu 20.04 images use Python 3.8.
  • The base distribution is indicated in the container image tag.

PyTorch Extras

ml-containers/torch-extras extends the ml-containers/torch images with a set of common PyTorch extensions:

  1. DeepSpeed
  2. FlashAttention
  3. NVIDIA Apex

Each one is compiled specially against the custom PyTorch builds in ml-containers/torch.

Both base and nccl editions are available for ml-containers/torch-extras matching those for ml-containers/torch. The base edition retains a small size, as a multi-stage build is used to avoid including CUDA development libraries in it, despite those libraries being required to build the extensions themselves.

PyTorch Nightly

ml-containers/nightly-torch is an experimental, nightly release channel of the PyTorch Base Images in the style of PyTorch's own nightly preview builds, featuring the latest development versions of torch, torchvision, and torchaudio pulled daily from GitHub and compiled from source.

ml-containers/nightly-torch-extras is a version of PyTorch Extras built on top of the ml-containers/nightly-torch container images. These are not nightly versions of the extensions themselves, but rather match the extension versions in the regular PyTorch Extras containers.

⚠ The PyTorch Nightly containers are based on unstable, experimental preview builds of PyTorch, and should be expected to contain bugs and other issues. For more stable containers use the PyTorch Base Images and PyTorch Extras containers.

Organization

This repository contains multiple container image Dockerfiles, each is expected to be within its own folder along with any other needed files for the build.

CI Builds (Actions)

The current CI builds are set up to run when changes to files in the respective folders are detected so that only the changed container images are built. The actions are set up with an action per image utilizing a reusable base action build.yml. The reusable action accepts several inputs:

  • folder - the folder containing the dockerfile for the image
  • image-name - the name to use for the image
  • build-args - arguments to pass to the docker build

Images built using the same source can utilize one action as the main reason for the multiple actions is to handle only building the changed images. A build matrix can be helpful for these cases https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs.