Local Installation of Repository

See the instructions here. Then clone and enter the repo.

git clone https://github.com/wesselb/aws
cd aws

Finally, make a virtual environment and install the requirements.

virtualenv -p python3 venv
source venv/bin/activate
pip install -r requirements.txt -e .

Sample Experiment

In the following, values that you need to set are bash variables (like $REPO) or Python constants (like KEY).

Setup AWS

  • Install and configure the Amazon CLI.

  • Create a new EC2 instance with the Deep Learning Base AMI (Amazon Linux 2). Make sure that you allocate enough disk space.

  • Create and name an appropriate security group.

  • Launch the instance.

Create an Image

  • Log into the instance.

  • Create a key for GitHub.

ssh-keygen -f ~/.ssh/github -t ed25519 -C "email@gmail.com" \
    && (echo "Host github.com"                 > ~/.ssh/config) \
    && (echo "    IdentityFile ~/.ssh/github" >> ~/.ssh/config) \
    && chmod 644 ~/.ssh/config \
    && echo "Public key:" \
    && cat ~/.ssh/github.pub
  • Add the public key to your GitHub account.

  • Configure the instance:

sudo amazon-linux-extras install python3.8 \
    && sudo yum install -y tmux htop python38-devel \
    && sudo pip3.8 install --upgrade pip setuptools Cython numpy virtualenv
  • Setup the AWS repository. Note: If the path to the repository is ~/aws, then ~/aws/venv must be a virtual environment which has the repository installed in editable mode.
cd ~
git clone git@github.com:wesselb/aws.git \
    && cd aws \
    && virtualenv venv -p python3.8 \
    && source venv/bin/activate \
    && pip install -r requirements.txt -e . \
    && deactivate \
    && cd ..
  • Setup the project repository:
cd ~
git clone git@github.com:$USER/$REPO.git \
    && cd $REPO \
    && virtualenv venv -p python3.8 \
    && source venv/bin/activate \
    && pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html \
    && pip install -r requirements.txt -e . \
    && deactivate \
    && cd ..
  • If necessary, transfer data to the instance:
rsync -e "ssh -i ~/.ssh/$KEY.pem" -Pav $DATA_DIR ec2-user@$IP:/home/ec2-user/$REPO
  • Stop the instance and create an image.

  • Once the image is ready, terminate the instance.

Test the Cluster

  • Create a file cluster.py:
import aws

aws.config["ssh_user"] = "ec2-user"
aws.config["ssh_key"] = f"~/.ssh/{KEY}.pem"
aws.config["setup_commands"] = [
    f"cd /home/ec2-user/{REPO}",
    "ssh-keygen -F github.com || ssh-keyscan github.com >> ~/.ssh/known_hosts",
    "git pull"

commands = [
    ["mkdir -p results", "touch results/one.txt"],
    ["mkdir -p results", "touch results/two.txt"],
    ["mkdir -p results", "touch results/three.txt"],

  • Here's what it can do:
usage: cluster.py [-h] [--spawn SPAWN] [--start] [--terminate] [--kill]
                  [--stop] [--sync-stopped] [--sync-sleep SYNC_SLEEP]

optional arguments:
  -h, --help            show this help message and exit
  --spawn SPAWN         Spawn instances.
  --start               Start experiments.
  --terminate           Terminate all instances. This is a kill switch.
  --kill                Kill all running experiments, but keep the instances
  --stop                Stop all running instances
  --sync-stopped        Synchronise all stopped instances.
  --sync-sleep SYNC_SLEEP
                        Number of seconds to sleep before syncing again.
  • Make an empty directory to synchronise to:
mkdir sync
  • Test that no instances are running:
$ python cluster.py
Instances still running: 0
Sleeping for two minutes...
  • Kill the script. Now spawn two instances and start the experiment:
$ python cluster.py --spawn 2 --start
  • Wait for the instances to have booted and the experiments to have started. The local folder sync/results should eventually contain the files one.txt, two.txt, and three.txt.

  • Wait a bit longer to ensure that the instances eventually shutdown themselves.

  • Kill the script. Now that all instances are stopped, remove everything in sync and attempt to sync the stopped instances:

$ python cluster.py --sync-stopped
  • If the contents of sync is restored, then we're golden! Terminate all instances of the cluster.
$ python cluster.py --terminate
Terminating all instances:
Instances still running: 0
Sleeping for two minutes...

Kill the script. You're now good to run your big experiment!


Synchronise to a Remote Host

        aws.Remote(user="user", host="host", key=f"~/.ssh/{KEY}"),

