Ever have a script or data pipeline you need to run on a remote machine? Ever wish you could just run your existing code on that machine, data and all? Don't want to do the "right thing" and use Docker? Well hey, you've come to the right place.
This will find the cheapest availability zone on AWS for an instance type you give it, launch a specified AMI in that availability zone, copy your current code and data directory to the machine, and automatically put you in an ssh session and browser session.
I most often use it with this AMI:
It ranks them by price and price variance, so your instance is less likely to get shut down by changes in demand.
There are a few things you need to do to get started, but thankfully I've made a bunch of scripts to make it easy.
To install, run:
python setup.py
- First, make sure your default AWS credentials work ok by running:
aws configure
- Pick an AMI. Use a public one, make your own, whatever. Then, to make sure your AMI is available in all regions, run the following:
aws-ami-copy -aws-access-key [...] -aws-secret-key [...] -ami [YOUR AMI ID HERE] -image_type [ubuntu, etc] -region [region ID]
- To make sure you have key access in all regions, run this to spread your key around:
This will generate your public key, copy it to a temporary file, and import the key pair to all available regions.
key-pairs-all-regions.sh /path/to/private/key.pem
All labbox configuration files are python files stored in ~/.lab_config
. The
config for a lab named my-lab
would be stored at ~/.lab_config/my-lab.py
.
You can list all available configs by running:
labbox config ls
To create a new configuration named my-lab
, run:
labbox config new my-lab --from_existing default
the --from_existing
flag specifies if you'd like to copy an existing config.
Running this command without the flag automatically copies the default config.
To edit your new config in your editor, run:
labbox config edit my-lab
Go through the example configuration below, and make sure your new configuration has sensible values for all the variables.
Here is an example configuration file, annotated. All of these variables must be defined for each Labbox config!
# ================= Project Specific Config ===========
GROUP_NAME = "default" # string, name for the lab
# list of directories to exclude when making a code archive
ARCHIVE_EXCLUDES = ["R-drake"]
# URL of the S3 bucket to use when creating a code archive
S3_BUCKET = "s3://mixing"
# SSH username for the AWS instance. Almost always `ubuntu`
SSH_USER_NAME = "ubuntu"
# Local path to your AWS key
PATH_TO_KEY = "~/attic/aws_2/dan.pem"
# Relative path from your project root (where you will run the labbox commands)
# to your "data directory"
DATA_DIR = "../data"
# Relative path for directory where data processing output goes
OUTPUT_DIR = "output"
# =============== Default configs ==================
# Regions to examine
AWS_REGIONS = ['us-east-2']
# Labbox caches availability zone scans. This is the expiration time of that
# cache. Set it lower to keep your availability zone info up to date
AZ_PICKLE_EXPIRE_TIME_DAYS = 30
# Expiration time for spot pricing information scraped from AWS. Set lower to
# always have more up to date spot pricing info.
SPOT_PRICING_PICKLE_EXPIRE_SEC = 30 * 60
# =============== Personal config ==================
KEY_NAME = "dan"
# The AMI you use that must be available in every region must have a name.
# Labbox uses this name to find and launch your AMI.
AMI_NAME = 'lab-in-a-box'
# List of Instancetypes to use for calculating spot pricing.
# Labbox will use the first element of this list when choosing what instance
# type to launch.
INSTANCE_TYPES = ['r4.8xlarge']
# maximum bid for your spot request
BID = 0.8
# how many instances to launch. Right now usage is only defined for 1.
QTY_INSTANCES = 1
# how long to wait when requesting a server before Labbox decides it's timed
# out.
SERVER_TIMEOUT = 60 * 5
WAIT_FOR_HTTP = True
WAIT_FOR_SSH = True
OPEN_IN_BROWSER = True
OPEN_SSH = True
ADD_TO_ANSIBLE_HOSTS = True
RUN_ANSIBLE = True
COPY_CODE = True
After creating a configuration, you can request a spot instance with your specified AMI using:
labbox launch --from-config my-lab
This will:
- Find the availability zone & region with the best "score" (based on mean price and variance) for the instance type specified in your config.
- Upload your current directory's code to the S3 bucket specified by
S3_BUCKET
- Download that code archive to the remote machine
- Move it to the
rstudio
users' home directory - Set the security group to accept ports
80
and22
- Launch an SSH session into the instance
- Launch a web browser into the instance
TO make sure your instance has your code uploaded, you can run:
labbox run my-lab 0 rsync
The 0
is a placeholder right now. In the future I hope Labbox will support
multiple instances, but for now it just handles one -- so you must specify 0
as the "instance ID"
Syncing data is a command that's run separately (since it often takes a while). To do this, run:
labbox data my-lab 0 sync data
If you change data on the remote machine and want to sync back to your local
box, you can run the sync command with the --pull
flag:
labbox data my-lab 0 sync data --pull
I often have a separate directory for "output" of my analysis, different from my
raw data directory. This directory's location is specified by the OUTPUT_DIR
config variable. You sync it in the same way you sync the data directory:
labbox data my-lab 0 sync output
# and to pull data
labbox data my-lab 0 sync output --pull
To ssh into your instance, you can run:
labbox run my-lab 0 ssh
To open your instance in a browser, you can run:
labbox run my-lab 0 browser
So far, working with labbox looks like this:
- I realize there's a command or analysis I need to run that takes a lot of RAM
- I run:
cd path/to/project/root
labbox launch --from-config my-lab
labbox run my-lab 0 rsync
labbox data my-lab 0 sync data
labbox data my-lab 0 sync output
labbox run my-lab 0 ssh
I do whatever processing I need to do. If I'm running Rstudio or Jupyter on the
instance, I run labbox run my-lab 0 browser
.
After the processing is done, I sync data changes. My projects are structured in such a way that I usually never alter the raw data (the "data" directory), so I usually just have to pull output:
labbox data my-lab 0 sync output --pull
If I've made any code changes on the server, I also run labbox run my-lab 0 rsync
For convenience Ansible is integrated into this tool. This allows one to
automatically run tasks on the servers after they are launched. This saves one
from needing to rebuild AMIs every time a change is required. See
user_config.py
and main.py
for more details. Be warned that hosts are not
automatically removed from the Ansible hosts
file.
This library is something I threw together for my personal use. The code is not well tested and is in no way production worthy. Feel free to contribute.
MIT