This is not official AWS repository. Code provided "as is".
This repository implements port of latest Detectron2 ("D2") to Amazon Sagemaker Components for Kubeflow Pipelines. Scope includes:
- finetune D2 model on custom dataset using Sagemaker distributed training and hosting via Kubeflow Pipelines.
Amazon Sagemaker uses docker containers both for training and inference:
Dockerfile
is training container, sources fromcontainer_training
directory will be added at training time;Dockerfile.serving
is serving container,container_serving
directory will added at inference time.- 'Dockerfile.dronetraining' is a custom training container for custom dataset.
Note: by default training container compiles Detectron2 for Volta architecture (Tesla V100 GPUs). If you'd like to run training on other GPU architectures, consider updating this environment variable. Here is an example on how to compile Detectron2 for all supported architectures:
ENV TORCH_CUDA_ARCH_LIST="Kepler;Kepler+Tesla;Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
Custom Container: We extended one of our existing and predefined SageMaker deep learning framework containers for this demo.
Amazon SageMaker provides prebuilt Docker images that include deep learning framework libraries and other dependencies needed for training and inference. With the Amazon SageMaker SageMaker Python SDK, you can train and deploy models using these popular deep learning frameworks. These prebuilt Docker images are stored in Amazon Elastic Container Registry (Amazon ECR).
The AWS Deep Learning Containers for PyTorch include containers for training on CPU and GPU, optimized for performance and scale on AWS. These Docker images have been tested with Amazon SageMaker, EC2, ECS, and EKS, and provide stable versions of NVIDIA CUDA, cuDNN, Intel MKL, and other required software components to provide a seamless user experience for deep learning workloads. All software components in these images are scanned for security vulnerabilities and updated or patched in accordance with AWS Security best practices.
You can customize these prebuilt containers or extend them to handle any additional functional requirements for your algorithm or model that the prebuilt Amazon SageMaker Docker image doesn't support.
See d2_byoc_coco2017_training.ipynb
for end-to-end example of how to train your Detectron2 model on Sagemaker. Current implementation supports both multi-node and multi-GPU training on Sagemaker distributed cluster.
- To define parameters of your distributed training cluster, use Sagemaker Estimator configuration:
d2 = sagemaker.estimator.Estimator(...
train_instance_count=2,
train_instance_type='ml.p3.16xlarge',
train_volume_size=100,
...
)
Detectron2 config is defined in Sagemaker Hyperparameters dict:
hyperparameters = {"config-file":"COCO-InstanceSegmentation/mask_rcnn_R_101_C4_3x.yaml",
#"local-config-file" : "config.yaml", # if you'd like to supply custom config file, please add it in container_training folder, and provide file name here
"resume":"True", # whether to re-use weights from pre-trained model
"eval-only":"False", # whether to perform only D2 model evaluation
# opts are D2 model configuration as defined here: https://detectron2.readthedocs.io/modules/config.html#config-references
# this is a way to override individual parameters in D2 configuration from Sagemaker API
"opts": "SOLVER.MAX_ITER 20000"
}
There are 3 ways how you can fine-tune your Detectron2 configuration:
- you can use one of Detectron2 authored config files (e.g.
"config-file":"COCO-InstanceSegmentation/mask_rcnn_R_101_C4_3x.yaml"
). - you can define your own config file and stored it
container_training
folder. In this case you need to definelocal-config-file
parameter with name of desired config file. Note, that you can choose eitherconfig-file
orlocal-config-file
. - you can modify individual parameters of Detectron2 configuration via
opts
list (e.g."opts": "SOLVER.MAX_ITER 20000"
above.
See d2_custom_drone_dataset.ipynb
notebook for details.
- Dockerfile – for our training we use customer container, which is extended from Sagemaker Pytorch 1.5.1 container. We do so because we need to build Detectron2 from sources.
- Define base container;
- Install required dependencies for Detectron2;
- Copies training script and utilities to container;
- Builds Detectron2 from source.
- A few highlights in container:
- L32 – copying directory with training script into container;
- L38 – define env variable for directory where training code will be uploaded
- L39 – define training script via pre-defined Sagemaker variable.
- Note, Sagemaker will start training job by running “python SAGEMAKER_PROGRAM – arg 1 -arg 2 …”
- A few highlights in container:
- Train_drone.py – main training script file. Highlights:
- #L199-L215 – argument parsing. These arguments will be defined by K8S YAML file.
- Line #207 - we use Detectron2 launch utility to start training on multiple nodes.
- #L217-L225 – start distributed training job. We use Detectron2 launch utility, which in its turn uses Pytorch native distributed.launch. As communication backend, we use NCCL as it gives best performance for GPU-based communication.
- #L125-L147 – parameters of training world. We use environmental variables defined by Sagemaker to identify number of hosts, processes, and host rank and pass these parameters to Detectron2 launch utility.
- Line #143 - in order to coordinate communication between different nodes, we need to gather details about our training world such as number of nodes, number of devices etc.
- #L181-L193 – main method for training. Once training is done, we save model in one of processes. Model artifact will be stored in SM_OUTPUT_MODEL_DATA dir. Content of this directory is automatically uploaded by Sagemaker to S3.
- Line #186 - Sagemaker will start training by running code in main guard on all training nodes;
- Line #8 /#169 - actual training script which will be executed on all processes and all GPU devices. Under the hood, PyTorch DistributedDataParallel class will ensure coordination of tasks between GPU devices.
- DisrtibutedDataParallel is a model wrapper to implement multiprocessing across multiple nodes.
- DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should spawn multiple processes and create a single DDP instance per process. DDP uses collective communications in the torch.distributed package to synchronize gradients and buffers. More specifically, DDP registers an autograd hook for each parameter given by model.parameters() and the hook will fire when the corresponding gradient is computed in the backward pass. Then DDP uses that signal to trigger gradient synchronization across processes.
- DisrtibutedDataParallel is a model wrapper to implement multiprocessing across multiple nodes.
- #L199-L215 – argument parsing. These arguments will be defined by K8S YAML file.
Training heavy-weight DNNs such as Mask R-CNN require high per GPU memory so you can pump one or more high-resolution images through the training pipeline. They also require high-speed GPU-to-GPU interconnect and high-speed networking interconnecting machines so synchronized allreduce of gradients can be done efficiently. Amazon SageMaker ml.p3.16xlarge and ml.p3dn.24xlarge instance types meet all these requirements. For more information, see Amazon SageMaker ML Instance Types. With eight Nvidia Tesla V100 GPUs, 128–256 GB GPU memory, 25–100 Gbps networking interconnect, and high-speed Nvidia NVLink GPU-to-GPU interconnect, they are ideally suited for distributed TensorFlow training on Amazon SageMaker.
- Amazon SageMaker requires the training algorithm and frameworks packaged in a Docker image.
- The Docker image must be enabled for Amazon SageMaker training. This enablement is simplified through the use of Amazon SageMaker containers, which is a library that helps create Amazon SageMaker-enabled Docker images.
- You need to provide an entry point script (typically a Python script) in the Amazon SageMaker training image to act as an intermediary between Amazon SageMaker and your algorithm code.
- To start training on a given host, Amazon SageMaker runs a Docker container from the training image and invokes the entry point script with entry point environment variables that provide information such as hyperparameters and the location of input data.
- The entry point script uses the information passed to it in the entry point environment variables to start your algorithm program with the correct args and polls the running algorithm process.
- When the algorithm process exits, the entry point script exits with the exit code of the algorithm process. Amazon SageMaker uses this exit code to determine the success or failure of the training job.
- The entry point script redirects the output of the algorithm process’ stdout and stderr to its own stdout. In turn, Amazon SageMaker captures the stdout from the entry point script and sends it to Amazon CloudWatch Logs. Amazon SageMaker parses the stdout output for algorithm metrics defined in the training job and sends the metrics to Amazon CloudWatch metrics.
- When Amazon SageMaker starts a training job that requests multiple training instances, it creates a set of hosts and logically names each host as algo-k, where k is the global rank of the host. For example, if a training job requests four training instances, Amazon SageMaker names the hosts as algo-1, algo-2, algo-3, and algo-4. The hosts can connect on the network using these hostnames.