Distributed Machine Learning on AWS Cloud: Computing with CPUs and GPUs

Introduction

Instead of buying, owning, and maintaining physical data centers and servers, you can access technology services (such as computing power, storage, and databases) from a cloud provider like Amazon Web Services (AWS), Microsoft Azure, Google Cloud, ...

This git help you achieve single machine computation and distributed (multiple) machine computation on AWS, with both CPU and GPU execution. Here is a table of contents including all applications we have.

Notes of Caution

  • Because cloud computing is charges by usage, please make sure to terminate or stop your virtual instances after you are done with them.
  • Because a cloud computing resource is often shared among many users, please include your name in your key file name, such as jianwu-key, so we know the creator of each virtual instance.
  • See different versions of this repository in Tags.

Table of Contents

The following is an overall instruction for all our implementations. For detailed instructions of CPU executions and GPU exectutions, please go to folders cpu-example and gpu-example.

Web based

  1. Launch instances on EC2 console:

  1. Choose an Amazon Machine Image (AMI)
    An AMI is a template that contains the software configuration (operating system, application server, and applications) required to launch your instance. For CPU applications, we use Ubuntu Server 20.04 LTS (HVM), SSD Volume Type; for GPU case, we use Deep Learning Base AMI (Ubuntu 16.04) Version 40.0.

  1. Choose an Instance Type
    Based on your purpose, AWS provides various instance types on https://aws.amazon.com/ec2/instance-types/. For CPU application, we recommand to use c5.2xlarge instance; For GPU application, we recommand to use p3.2xlarge instance.

  1. Configure Number of instances
    We use 1 instance for single machine computation, and 2 instances for distributed computation.

  1. Configure Security Group

  1. Review, Create your SSH key pair, and Launch

  1. View your Instance and wait for Initialing

  1. SSH into your instance

  1. Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo service docker start
sudo usermod -a -G docker ubuntu
sudo chmod 666 /var/run/docker.sock
  1. Download Docker imagesor build images by Dockerfile.
  • CPU example:
docker pull starlyxxx/dask-decision-tree-example
  • GPU example:
docker pull starlyxxx/horovod-pytorch-cuda10.1-cudnn7
  • or, build from Dockerfile:
docker build -t <your-image-name> .
  1. Download ML applications and data on AWS S3.
  • For privacy, we store the application code and data on AWS S3. Install aws cli and set aws credentials.
curl 'https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip' -o 'awscliv2.zip'
unzip awscliv2.zip
sudo ./aws/install
aws configure set aws_access_key_id your-access-key
aws configure set aws_secret_access_key your-secret-key
  • Download ML applications and data on AWS S3.

    • CPU example:

      Download:

    aws s3 cp s3://kddworkshop/ML_based_Cloud_Retrieval_Use_Case.zip ./

    or

    (wget https://kddworkshop.s3.us-west-.amazonaws.com/ML_based_Cloud_Retrieval_Use_Case.zip)

    Extract the files:

    unzip ML_based_Cloud_Retrieval_Use_Case.zip
    • GPU example:

      Download:

    aws s3 cp s3://kddworkshop/MultiGpus-Domain-Adaptation-main.zip ./
    aws s3 cp s3://kddworkshop/office31.tar.gz ./

    or

    wget https://kddworkshop.s3.us-west-2.amazonaws.com/MultiGpus-Domain-Adaptation-main.zip
    wget https://kddworkshop.s3.us-west-2.amazonaws.com/office31.tar.gz

    Extract the files:

    unzip MultiGpus-Domain-Adaptation-main.zip
    tar -xzvf office31.tar.gz
  1. Run docker containers for CPU applications.
  • Single CPU:
docker run -it -v /home/ubuntu/ML_based_Cloud_Retrieval_Use_Case:/root/ML_based_Cloud_Retrieval_Use_Case starlyxxx/dask-decision-tree-example:latest /bin/bash
  • Multi-CPUs:
docker run -it --network host -v /home/ubuntu/ML_based_Cloud_Retrieval_Use_Case:/root/ML_based_Cloud_Retrieval_Use_Case starlyxxx/dask-decision-tree-example:latest /bin/bash
  1. Run docker containers for GPU applications
  • Single GPU:
nvidia-docker run -it -v /home/ubuntu/MultiGpus-Domain-Adaptation-main:/root/MultiGpus-Domain-Adaptation-main -v /home/ubuntu/office31:/root/office31 starlyxxx/horovod-pytorch-cuda10.1-cudnn7:latest /bin/bash
  • Multi-GPUs:

    • Add primary worker’s public key to all secondary workers’ <~/.ssh/authorized_keys>
    sudo mkdir -p /mnt/share/ssh && sudo cp ~/.ssh/* /mnt/share/ssh
    • Primary worker VM:
    nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh -v /home/ubuntu/MultiGpus-Domain-Adaptation-main:/root/MultiGpus-Domain-Adaptation-main -v /home/ubuntu/office31:/root/office31 starlyxxx/horovod-pytorch-cuda10.1-cudnn7:latest /bin/bash
    • Secondary workers VM:
    nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh -v /home/ubuntu/MultiGpus-Domain-Adaptation-main:/root/MultiGpus-Domain-Adaptation-main -v /home/ubuntu/office31:/root/office31 starlyxxx/horovod-pytorch-cuda10.1-cudnn7:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
  1. Run ML CPU application:

    • Single CPU:
    cd ML_based_Cloud_Retrieval_Use_Case/Code && /usr/bin/python3.6 ml_based_cloud_retrieval_with_data_preprocessing.py
    • Multi-CPUs:

      • Run dask cluster on both VMs in background:

        • VM 1:
        dask-scheduler & dask-worker <your-dask-scheduler-address> &
        • VM 2:
        dask-worker <your-dask-scheduler-address> &
    • One of VMs:

    cd ML_based_Cloud_Retrieval_Use_Case/Code && /usr/bin/python3.6 dask_ml_based_cloud_retrieval_with_data_preprocessing.py <your-dask-scheduler-address>
  2. Run ML GPU application

    • Single GPU:
    cd MultiGpus-Domain-Adaptation-main
    horovodrun --verbose -np 1 -H localhost:1 /usr/bin/python3.6 main.py --config DeepCoral/DeepCoral.yaml --data_dir ../office31 --src_domain webcam --tgt_domain amazon
    • Multi-GPUs:

      • Primary worker VM:
      cd MultiGpus-Domain-Adaptation-main
      horovodrun --verbose -np 2 -H <machine1-address>:1,<machine2-address>:1 -p 12345 /usr/bin/python3.6 main.py --config DeepCoral/DeepCoral.yaml --data_dir ../office31 --src_domain webcam --tgt_domain amazon
  3. Terminate all VMs on EC2 when finishing experiments.

Command line automation via Boto

Follow steps below for automating single machine computation. For distributed machine computation, see README on each example's sub-folder.

Prerequisites:

pip3 install boto fabric2 scanf IPython invoke
pip3 install Werkzeug --upgrade

Run single machine computation:

  1. Configuration

Use your customized configurations. Replace default values in <./config/config.ini>

  1. Start IPython
python3 run_interface.py
  1. Launch VMs on EC2 and wait for initializing
LaunchInstances()
  1. Install required packages on VMs
InstallDeps()
  1. Automatically run Single VM ML Computing
RunSingleVMComputing()
  1. Terminate all VMs on EC2 when finishing experiments.
TerminateAll()

For a closer look, please refer to our slides or presentation.