Distributed training using Amazon SageMaker Distributed Data Parallel library and debugging using Amazon SageMaker Debugger
This repository contains an example for performing distributed training on Amazon SageMaker using SageMaker's Distributed Data Parallel library and debugging using Amazon SageMaker Debugger. The training scripts cover both zero-script-change and with-script-change scenarios for the Debugger.
Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train and deploy machine learning (ML) models quickly. With SageMaker, you have the option of using the built-in algorithms as well as bringing your own algorithms and frameworks. One such framework is TensorFlow 2.x. When performing distributed training with this framework, you can use SageMaker's Distributed Data Parallel or Distributed Model Parallel libraries. Amazon SageMaker Debugger debugs, monitors and profiles training jobs in real time thereby helping with detecting non-converging conditions, optimizing resource utilization by eliminating bottlenecks, improving training time and reducing costs of your machine learning models.
This example contains a Jupyter Notebook that demonstrates how to use a SageMaker optimized TensorFlow 2.x container to perform distributed training on the Fashion MNIST dataset using the SageMaker Distributed Data Parallel library and debug using SageMaker Debugger. It also implements a custom training loop i.e. customizes what goes on in the fit() loop. Finally the debugger's output is analyzed. This notebook will take your training script and use SageMaker in script mode.
This repository contains
-
A Jupyter Notebook to get started
-
A training script in Python for zero-script-change scenario that is passed to the training job
-
A training script in Python for with-script-change scenario that is passed to the training job
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.