/amazon-sagemaker-dist-data-parallel-with-debugger

How to perform distributed training on Amazon SageMaker using SageMaker's Distributed Data Parallel library and debug using Amazon SageMaker Debugger.

Primary LanguageJupyter NotebookMIT No AttributionMIT-0

Distributed training using Amazon SageMaker Distributed Data Parallel library and debugging using Amazon SageMaker Debugger

This repository contains an example for performing distributed training on Amazon SageMaker using SageMaker's Distributed Data Parallel library and debugging using Amazon SageMaker Debugger. The training scripts cover both zero-script-change and with-script-change scenarios for the Debugger.

Overview

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train and deploy machine learning (ML) models quickly. With SageMaker, you have the option of using the built-in algorithms as well as bringing your own algorithms and frameworks. One such framework is TensorFlow 2.x. When performing distributed training with this framework, you can use SageMaker's Distributed Data Parallel or Distributed Model Parallel libraries. Amazon SageMaker Debugger debugs, monitors and profiles training jobs in real time thereby helping with detecting non-converging conditions, optimizing resource utilization by eliminating bottlenecks, improving training time and reducing costs of your machine learning models.

This example contains a Jupyter Notebook that demonstrates how to use a SageMaker optimized TensorFlow 2.x container to perform distributed training on the Fashion MNIST dataset using the SageMaker Distributed Data Parallel library and debug using SageMaker Debugger. It also implements a custom training loop i.e. customizes what goes on in the fit() loop. Finally the debugger's output is analyzed. This notebook will take your training script and use SageMaker in script mode.

Repository structure

This repository contains

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.