Bash scripts for deploying agent trainer to a remote machine. More details about the training process in this blogpost
Two flavors of deployments are available:
- AWS EC2 instances, be it GPU enabled or not
- Generic Linux remote machine, be it GPU enabled or not (only tested on CentOS7 as of now)
Before proceeding, make sure the root folder's scripts are executable after cloning this repository to your local machine. Example of how to make the script executable: $ chmod u+x aws_ec2_train_new.sh
###AWS EC2
The scripts are built to support EBS external volumes by default, in order to persist the training results after the instance is terminated, and to have a finer control over the disk's performance. GPU enabled g2.2xlarge spot instances are used by default for training instances.
Pre-requisites for runnning the AWS EC2 deployment scripts:
- AWS CLI: install guide
- jq JSON Processor: More info here
Global AWS EC2 setup:
- Select the region where you would like to deploy. For example, set
region = ap-northeast-1
on your./aws/config
if you select the Tokyo region1 - On
aws_ec2/launch_instance.sh
, setSUBNET_ID
to the subnet ID you want to use. To find out which subnets are available on a given region, you can run$ aws ec2 describe-subnets
. For example,ap-northeast-1a
corresponds to thesubnet-f4269e9c
value - On
aws_ec2/launch_instance.sh
, setIMAGE_ID
to the base AMI ID used on all instances. The scripts were tested using a HVM SSD EBS-Backed 64-bit (ami-374db956 on the Tokyo region). To consult which base AMIs are available for a given region, consult this link - On
aws_ec2/launch_instance.sh
, setSECURITY_GROUP
to the security group ID you want to use. Make sure the security group accepts SSH inbound connections - On
aws_ec2/launch_instance.sh
, setKEY_PAIR_NAME
to the key pair name used to access the instance - On
common/constants.sh
, setSSH_KEY_PATH
to the path where the SSH authentication key (the same used to create the AWS key pair referenced on step 5.) is stored on your local machine. For example:SSH_KEY_PATH=/Users/my-username/.ssh/my-ssh-key
###AWS EC2
-
On
aws_ec2_train_new.sh
, setYOUR_OUTRUN_ROMS_PATH
to the local folder where you have your Out Run game roms -
Create a new EBS volume on the AWS console on the same subnet as the one chosen above. For example create a 200 GB General Purpose SSD (GP2, 600 IOPS)2
-
On
common/constants.sh
setEXTERNAL_VOLUME_AWS_VOLUME_ID
to the newly created volume id -
Format the newly created volume by running:
$ ./aws_ec2_format_external_volume.sh
-
Check the spot instance bid prices and change the
aws_ec2/launch_instance.sh
maximum bid parameter onaws_ec2_train_new.sh
if needed3 -
Run:
$ ./aws_ec2_train_new.sh
Note: by default the g2.2xlarge instance is used, and USE_GPU
parameter is set to true
on common/constants.sh
, in order to take full advantage of instance's provided GPU
###Generic Linux (CentOS7) Machine
If you already have a remote machine available, make sure it is accessable through SSH and follow these configuration steps:
-
On
common/constants.sh
, setHOST_USERNAME
to the remote username which will execute the remote actions -
Make sure remote user's remote SSH login authenticantion can be made via SSH key, not through password
-
On
common/constants.sh
, setSSH_KEY_PATH
to the path where the SSH authentication key is stored on your local machine. For example:SSH_KEY_PATH=/Users/my-username/.ssh/my-ssh-key
-
On
common/constants.sh
, setUSE_GPU
to true or false, according if want to enable/disable GPU support on your build (the remote machine will need to have a CUDA enabled NVidia card with NVidia Compute Capability >= 3.0 for GPU enabled training sessions) -
Find the remote machine's IP address and set the
ip_address
variable ongeneric_train_new.sh
-
On
generic_train_new.sh
, setYOUR_OUTRUN_ROMS_PATH
to the local folder where you have your Out Run game roms -
Run:
$ ./generic_train_new.sh
Customize the deployed code
The code deployed on the guide above is the one used originally for the agent trainer, docker image and cannonball Out Run emulator. If you want to change it, you can simply fork the repositories, change them to your liking and then change the script common/fetch_source_repositories.sh
to point to your custom repositories.
Customize the deployed code: using private repositories
If you want to use your private GitHub repositories:
-
On
support/github_ssh_key_constants.sh
, setGITHUB_SSH_KEY_NAME
to the key´s file name andLOCAL_FOLDER_SSH_KEY
to the local folder where it´s placed. For example, if the SSH key is placed on/Users/your-user-name/.ssh/your-github-ssh-key-name
, thenLOCAL_FOLDER_SSH_KEY="/Users/your-user-name/.ssh"
andGITHUB_SSH_KEY_NAME="your-github-ssh-key-name"
-
Make sure the repositories changed in
common/fetch_source_repositories.sh
are cloned via SSH. This is, they should have this structure: git clone **git@github.com:username/**custom-agent-trainer.git ${HOST_PATH_AGENT_TRAINER} -
Replace the repositories fetch line in
<generic/aws_ec2>_train_new.sh
:# NEW (...) . ./support/github_ssh_key_copy.sh cat common/constants.sh support/github_ssh_key_constants.sh support/github_ssh_key_add.sh common/fetch_source_repositories.sh | ssh -i ${SSH_KEY_PATH} ${HOST_USERNAME}@${ip_address} (...) # OLD (...) cat common/constants.sh common/fetch_source_repositories.sh | ssh -i ${SSH_KEY_PATH} ${HOST_USERNAME}@${ip_address} (...)
##Usage: Resume training, retrieve results and debug
###Resume training
You can resume the training session if for some reason the training is halted4. Setup:
- Set
SESSION_ID
on<generic/aws_ec2>_train_resume.sh
to the session ID you want to resume
AWS EC2
A spot instance will be created by default. Check the aws_ec2/launch_instance.sh
maximum bid parameter used on aws_ec2_train_resume.sh
$ ./aws_ec2_train_resume.sh
Generic Linux (CentOS7) Machine
Set ip_address
on <generic/aws_ec2>_train_resume.sh
to the remote machine's IP
$ ./generic_train_resume.sh
###Retrieve results
Retrieve the session's training results to your local machine. Setup:
- Set
SESSION_ID
on<generic/aws_ec2>_retrieve_results.sh
to the session ID you want to resume - Set
RETRIEVED_TRAINING_RESULTS_PATH
on<generic/aws_ec2>_retrieve_results.sh
to the local path where the results will be downloaded
AWS EC2
You have two alternatives available in the aws_ec2_retrieve_results.sh
script via the CREATE_NEW_INSTANCE
variable:
- If set to
false
, the results will be retrieved directly from the training instance. Set theip_address
to the training instance's public IP address - If set to
true
, a new on-demand instance will be created and will mount the external volume. An EBS volume can only be mounted by one instance at a time, so if the training instance is still running when you perform this kind of retrieve, the new script will wait until the external volume is made available
$ ./aws_ec2_retrieve_results.sh
Generic Linux (CentOS7) Machine
Set ip_address
on <generic/aws_ec2>_retrieve_results.sh
to the remote machine's IP
$ ./generic_retrieve_results.sh
###Debug for AWS EC2
Launches the shell on a new on-demand instance attached to the external EBS volume
$ ./aws_ec2_debug_ssh.sh
---
1 The Tokyo region was chosen to perform the trainings described in this blogpost due to the low consistent g2.2xlarge spot instance bid prices
2 Using the default agent-traing configuration, each training run can reach up to 25GB worth of replay memories, which need to be accessed randomly during the training process. Since these cannot fit into the g2.2xlarge instance's 16GB of RAM, about 600 IOPS are required to keep the training performance acceptable. GP2 volumes provide more IOPS as you increase their size, hence the allocation of a 200 GB General Purpose SSD (GP2, 600 IOPS), which turns out to be more cost effective than a smaller 30GB, 600 IOPS Provisioned SSD (IO1).
3 In order to aquire a spot instance and keep it, you need to place a bid that is not lower than the current one placed on the instance
4 For example, a AWS EC2 the spot instance can be terminated if someone overbids you.