Comparing Amazon S3, Amazon S3 FastFile mode, Amazon EFS, and Amazon FSx for Lustre.
-
Manage your SageMaker service limits. You will need a minimum limit of 2
ml.p3.16xlarge
and 2ml.p3dn.24xlarge
instance types, but a service limit of 4 for each instance type is recommended. Keep in mind that the service limit is specific to each AWS region. We recommend usingus-west-2
region for this tutorial. -
Create an Amazon S3 bucket in the AWS region where you would like to execute this tutorial. Save the S3 bucket name. You will need it later.
In this tutorial, our focus is distributed TensorFlow training using Amazon SageMaker.
Concretely, we will discuss distributed TensorFlow training for TensorPack Mask/Faster-RCNN and AWS Samples Mask R-CNN algorithms using COCO 2017 dataset.
This tutorial has two key steps:
-
We use Amazon CloudFormation to create a new Sagemaker notebook instance in an Amazon Virtual Private Network (VPC).
-
We use the SageMaker notebook instance to launch distributed training jobs in the VPC using Amazon S3, Amazon EFS, or Amazon FSx Lustre as data source for training data pipeline.
If you are viewing this page from a SageMaker notebook instance and wondering why we need a new SageMaker notebook instance. the reason is that your current SageMaker notebook instance may not be running in a VPC, may not have an IAM Role attached that provides access to required AWS resources, or may not have access to EFS mount targets that we need for this tutorial.
Our objective in this step is to create a SageMaker notebook instance in a VPC. We have two options. We can create a SageMaker notebook instance in a new VPC, or we can create the notebook instance in an existing VPC. We cover both options below.
The AWS IAM User or AWS IAM Role executing this step requires AWS IAM permissions consistent with Network Administrator job function.
The CloudFormation template cfn-sm.yaml can be used to create a CloudFormation stack that creates a SageMaker notebook instance in a new VPC.
You can create the CloudFormation stack using cfn-sm.yaml directly in CloudFormation service console.
Alternatively, you can customize variables in stack-sm.sh script and execute the script anywhere you have AWS Command Line Interface (CLI) installed. The CLI option is detailed below:
- Install AWS CLI
- In
stack-sm.sh
, setAWS_REGION
to your AWS region andS3_BUCKET
to your S3 bucket . These two variables are required. - Optionally, you can set
EFS_ID
variable if you want to use an existing EFS file-system. If you leaveEFS_ID
blank, a new EFS file-system is created. If you chose to use an existing EFS file-system, make sure the existing file-system does not have any existing mount targets. - Optionally, you can specify
GIT_URL
to add a Git-hub repository to the SageMaker notebook instance. If the Git-hub repository is private, you can specifyGIT_USER
andGIT_TOKEN
variables. - Execute the customized
stack-sm.sh
script to create a CloudFormation stack using AWS CLI.
The estimated time for creating this CloudFormation stack is 9 minutes. The stack will create following AWS resources:
- A SageMaker execution role
- A Virtual Private Network (VPC) with Internet Gateway (IGW), 1 public subnet, 3 private subnets, a NAT gateway, a Security Group, and a VPC Gateway Endpoint to S3
- An optional Amazon EFS file system with mount targets in each private subnet in the VPC.
- A SageMaker Notebook instance in the VPC:
- The EFS file-system is mounted on the SageMaker notebook instance
- The SageMaker execution role attached to the notebook instance provides appropriate IAM access to AWS resources
Save the summary output of the script. You will need it later. You can also view the output under CloudFormation Stack Outputs tab in AWS Management Console.
This option is only recommended for advanced AWS users. Make sure your existing VPC has following:
- One or more security groups
- One or more private subnets with NAT Gateway access and existing EFS file-system mount targets
- Endpoint gateway to S3
Create a SageMaker notebook instance in a VPC using AWS SageMaker console. When you are creating the SageMaker notebook instance, add at least 100 GB of local EBS volume under advanced configuration options. You will also need to mount your EFS file-system on the SageMaker notebook instanxce.
In SageMaker console, open the Juypter Lab notebook server you created in the previous step. In this Juypter Lab instance, there are three Jupyter notebooks for training Mask R-CNN. All three notebooks use SageMaker TensorFlow Estimator in Script Mode, whereby we can keep the SageMaker entry point script outside the Docker container, and pass it as a parameter to SageMaker TensorFlow Estimator. The SageMaker TensorFlow Estimator also allows us to specify the distribution
type, which means we don't have to write code in the entry point script for managing SageMaker distributed training, which greatly simplifies the entry point script.
The four SageMaker Script Mode notebooks for training Mask R-CNN are listed below:
- Mask R-CNN notebook that uses S3 bucket as data source:
mask-rcnn-scriptmode-s3.ipynb
- Mask R-CNN notebook that uses S3 bucket with FastFile mode as data source:
mask-rcnn-scriptmode-s3-ffm.ipynb
- Mask R-CNN notebook that uses EFS file-system as data source:
mask-rcnn-scriptmode-efs.ipynb
- Mask R-CNN notebook that uses FSx Lustre file-system as data source:
mask-rcnn-scriptmode-fsx.ipynb
Below, we compare the three options, Amazon S3, Amazon EFS and Amazon FSx Lustre:
Data Source | Description |
---|---|
Amazon S3 | Each time the SageMaker training job is launched, it takes approximately 28 minutes to download COCO 2017 dataset from your S3 bucket to the Amazon EBS volume attached to each training instance. During training, data is input to the training data pipeline from the EBS volume attached to each training instance. |
Amazon S3 FastFile mode | Each time the SageMaker training job is launched, it takes approximately 3 minutes to download COCO 2017 dataset from your S3 bucket to the Amazon EBS volume attached to each training instance using FastFile mode. During training, data is input to the training data pipeline from the EBS volume attached to each training instance. |
Amazon EFS | It takes approximately 46 minutes to copy COCO 2017 dataset from your S3 bucket to your EFS file-system. You only need to copy this data once. During tranining, data is input from the shared Amazon EFS file-system mounted on all the training instances. |
Amazon FSx Lustre | It takes approximately 10 minutes to create a new FSx Lustre file-system and import COCO 2017 dataset from your S3 bucket to the new FSx Lustre file-system. You only need to do this once. During training, data is input from the shared Amazon FSx Lustre file-system mounted on all the training instances. |
In all three cases, the logs and model checkpoints output during training are written to the EBS volume attached to each training instance, and uploaded to your S3 bucket when training completes. The logs are also fed into CloudWatch as training progresses and can be reviewed during training.
System and model training metrics are fed into Amazon CloudWatch metrics during training and can be visualized in SageMaker console.
Below are sample experiment results for the two algorithms, after training for 24 epochs on COCO 2017 dataset:
-
TensorPack Mask/Faster-RCNN algorithm
- coco_val2017-mAP(bbox)/IoU=0.5: 0.59231
- coco_val2017-mAP(bbox)/IoU=0.5:0.95: 0.3844
- coco_val2017-mAP(bbox)/IoU=0.75: 0.41564
- coco_val2017-mAP(bbox)/large: 0.51084
- coco_val2017-mAP(bbox)/medium: 0.41643
- coco_val2017-mAP(bbox)/small: 0.21634
- coco_val2017-mAP(segm)/IoU=0.5: 0.56011
- coco_val2017-mAP(segm)/IoU=0.5:0.95: 0.34917
- coco_val2017-mAP(segm)/IoU=0.75: 0.37312
- coco_val2017-mAP(segm)/large: 0.48118
- coco_val2017-mAP(segm)/medium: 0.37815
- coco_val2017-mAP(segm)/small: 0.18192
-
AWS Samples Mask R-CNN algorithm
- mAP(bbox)/IoU=0.5: 0.5983
- mAP(bbox)/IoU=0.5:0.95: 0.38296
- mAP(bbox)/IoU=0.75: 0.41296
- mAP(bbox)/large: 0.50688
- mAP(bbox)/medium: 0.41901
- mAP(bbox)/small: 0.21421
- mAP(segm)/IoU=0.5: 0.56733
- mAP(segm)/IoU=0.5:0.95: 0.35262
- mAP(segm)/IoU=0.75: 0.37365
- mAP(segm)/large: 0.48337
- mAP(segm)/medium: 0.38459
- mAP(segm)/small: 0.18244