Azure Machine Learning base images

This repository contains Dockerfiles for the base images used in Azure Machine Learning.

Introduction
Base image dependencies
How to get Azure ML images
- Featured tags
Using Azure ML base images for training
Using your own custom Docker image or Dockerfile for training
Resources

Introduction

These Docker images serve as base images for training and inference in Azure ML. While submitting a training job on AmlCompute or any other target with Docker enabled, Azure ML runs your job in a conda environment within a Docker container.

You can also use these Docker images as base images for your custom Azure ML Environments. If you specify any conda dependencies in your Environment, the extra dependencies are installed on top of the dependencies in the Docker image.

Note that these base images do not come with Python packages, notably the Azure ML Python SDK, installed. If you require the Azure ML SDK package for your job, make sure you also install the appropriate package.

Please note that images supporting Ubuntu 16.04 are now deprecated. We recommend using images supporting Ubuntu 18.04 for the timebeing as we transition towards providing 20.04 images.

Base image dependencies

Currently Azure ML supports cuda9, cuda10 and cuda11 base images. The major dependencies installed in the base images are Miniconda, OpenMPI, CUDA, cuDNN, NCCL, and git. For more detailed information, please view the dockerfiles.

The CPU images are built from ubuntu18.04 and ubuntu20.04.

The GPU images for cuda9 are built from nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04.

The GPU images for cuda10 are built from:

nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04
nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04

The GPU images for cuda11 are built from:

nvidia/cuda:11.0.3-cudnn8-devel-ubuntu18.04
nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
nvidia/cuda:11.1.1-cudnn8-devel-ubuntu20.04

How to get the Azure ML images

All images in this repository are published to Microsoft Container Registry (MCR).

You can pull these images from MCR using the following command:

CPU image example: docker pull mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
GPU image example: docker pull mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04

If you observe the naming convention, image tag maps to the folder name that contains the corresponding Dockerfile.

GPU images pulled from MCR can only be used with Azure Services. Take a look at LICENSE.txt file inside the docker container for more information. GPU images are built from nvidia images. For NVIDIA CUDA and cuDNN take a look at the ThirdPartyNotices.txt file inside the docker container for more information about NVIDIA’s license terms

Featured tags

Below is the list of tags:

OpenMPI CPU - Ubuntu 20.04
- openmpi4.1.0-ubuntu20.04
OpenMPI CPU - Ubuntu 22.04
- openmpi4.1.0-ubuntu22.04
OpenMPI GPU - cuda11.1 - Ubuntu 20.04
- openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04
OpenMPI GPU - cuda11.2 - Ubuntu 20.04
- openmpi4.1.0-cuda11.2-cudnn8-ubuntu20.04
OpenMPI GPU - cuda11.3 - Ubuntu 20.04
- openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04
OpenMPI GPU - cuda11.6 - Ubuntu 20.04
- openmpi4.1.0-cuda11.6-cudnn8-ubuntu20.04
OpenMPI GPU - cuda11.8 - Ubuntu 22.04
- openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04

Using Azure ML base images for training

In some cases, the Azure ML base images will be used by default:

By default, if no base image is explicitly set by the user for a training run, Azure ML will use the image corresponding to azureml.core.environment.DEFAULT_CPU_IMAGE.

If you are using an Azure ML curated environment, those are already configured with one of the Azure ML base images. To see which base image is used by a specific curated environment, you can run the following:

from azureml.core import Environment

curated_env_name = 'AzureML-pytorch-1.7-ubuntu18.04-py37-cuda11-gpu'
pytorch_env = Environment.get(workspace=ws, name=curated_env_name)
print(pytorch_env.docker.base_image)

If you want to instead explicitly use one of the Azure ML base images for your job, you can follow the steps below.

Prerequisites

Install Azure ML SDK and setup environment
Quickstarts, end-to-end tutorials, and how-tos on the official documentation site for Azure Machine Learning service.
Python SDK reference

Create an Azure ML Environment

If your training script requires additional dependencies, create a YAML file that defines the conda dependencies. In the below example, the file is named conda_dependencies.yml:

channels:
- conda-forge
dependencies:
- python=3.6.2
- pip:
  - azureml-defaults
  - tensorflow-gpu==2.2.0

Then, create an Azure ML environment from this conda environment specification.

from azureml.core import Environment

env = Environment.from_conda_specification(name='my-env', file_path='./conda_dependencies.yml')

If your script does not require any additional dependencies and you would just like to use the base image directly, just instantiate an Environment object with the following:

from azureml.core import Environment

env = Environment(name='my-env')

Then, for both of the above cases, set the base image you would like to use. For example, here we will specify the openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04 image:

env.docker.enabled = True
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'

Configure and submit the training job

Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on.

from azureml.core import Workspace, Experiment
from azureml.core import ScriptRunConfig

ws = Workspace.from_config()
compute_target = ws.compute_targets['my-cluster-name']

src = ScriptRunConfig(source_directory='.',
                      script='train.py',
                      compute_target=compute_target,
                      environment=env)
                      
run = Experiment(workspace=ws, name='my-experiment').submit(src)
run.wait_for_completion(show_output=True)

What happens during job execution

As the job is executed, it goes through the following stages:

Preparing: A docker image is created according to the environment defined. The image is uploaded to the workspace's Azure Container Registry and cached for later runs. A new Docker image is built if this is the first time a combination of dependencies are used in a workspace. If not, a cached Docker image is used. Logs are also streamed to the run history and can be viewed to monitor progress. If a curated environment is specified instead, the cached image backing that curated environment will be used.
Scaling: The cluster attempts to scale up if the cluster requires more nodes to execute the run than are currently available.
Running: All scripts in the script folder are uploaded to the compute target, any datasets specified are mounted or downloaded, and the script is executed. Outputs from stdout and the ./logs folder are streamed to the run history and can be used to monitor the run.
Post-Processing: The ./outputs folder of the run is copied over to the run history.

Using your own custom Docker image or Dockerfile for training

If you instead want to use your own custom Docker image or Dockerfile for your training job instead of the Azure ML base images, you can refer to the documentation Train using a custom image.

Resources

For additional documentation and tutorials, see the following:

Projects using Azure Machine Learning

Visit following repositories to see the projects contributed by Azure ML users:

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Azure/AzureML-Containers