SpotManager
The SpotManager is a state-less program meant to be run periodically. It finds the cheapest spot instance prices, bids, sets up the machines, and tears them down when done.
Assumptions
The module assumes your workload is long running and has many save-points.
In my case each machine is setup to pull small tasks off a queue and execute them. These machines can be shutdown at any time; with the most recent task simply placed back on the queue for some other machine to run.
Overview
This library works on a concept of utility, which is an abstract value you assign to each EC2 instance type; the required utility is the primary input used to scale the number and type of instances.
For each instance type (and zone), the SpotManager
uses the historical
pricing record to figure out a competitive bid (defined by uptime
, below).
It combines that bid with the utility
score for that instance type to get
an estimated_value
(measured in utility per dollar). The instance types
with the best estimated_value
, are bid on first.
Requirements
- Python 2.7
- boto
- requests
- ecdsa (required by fabric, but not installed by pip)
- fabric2
Installation
For now, you must clone the repo
git clone https://github.com/klahnakoski/SpotManager.git
Branches
There are three main branches
- dev - development done here (unstable)
- manager-etl - multithreaded management, not ready for ES node management (Oct 2018)
- manager - used to manage the staging clusters
- master - proven stable on manager for at least a few days
Configuration
Each SpotManager instance requires a settings.json
file that controls the
SpotManager behaviour. We will use the ActiveData ETL settings file
as an example to explain the parameters
budget
- Acts as an absolute spend limit for this SpotManager. Be sure you know your limits.max_utility_price
- Whatever you decide a unit of utility is, you should set the highest price you are willing to pay for one. This can ensure you do not go over the on-demand price, and prevents the SpotManager from bidding when everything is too expensive.max_new_utility
- The most utility that will be requested per run. Used to prevent spikes in instance count on light loads.max_requests_per_type
- Limit the number of requests per type. Prevents all requests going to the cheapest instance type, consuming all available instances, and gettingaz-constraint
on the remainder. In the event of low availability, SpotManager will move on to the other types.max_percent_per_type
- Limit the total number of instances, as a percent, per availability zone. Some workloads benefit from not loosing all instances at once. Distributing load over many instance types reduces the number of instances lost from any one price fluctuation. Default = 1.0 (100%, no limit)uptime
- Parameters that help you balance expected uptime with cost (see below).availability_zone
- List of availability zones the SpotManager can work inproduct
- For price lookup. Default 'Linux/UNIX (Amazon VPC)'price_file
- To minimize AWS calls, the previous price data is stored in a file for retrieval next time.run_interval
- So the SpotManager knows how long before the next run will happen (Used to determine time remaining in the hour for an instance)aws
- a structure containing the parameters to connect to AWS using botoutility
- a list of objects declaring the utility of each instance type. Instance types not mentioned are assumed to have zero utility and will not be bid on, and will be terminated if any exist.ec2.request
- template for making a spot request using boto. This is where you declare the machine image, private keys, networking interfaces, etc.ec2.instance.name
- Name that will be assigned to an instance (and to the spot requests). It is important that no other machines under the AWS user have this prefix. Any machines with this prefix will be under the control of SpotManager.instance
- The parameters that will be sent to the constructor for yourInstanceManager
.instance.class
- An additional property ininstance
: The full name of the class you are using to setup/teardown an instance.debug
- Settings for the logging module
utility
More about The utility list is a declaration of how much utility each instance type can
provide, and additional configuration that the InstanceManager can use for
setup()
.
uptime
More about In order to make a good bid, the historical pricing record for each instance- type and region is used. All these settings have defaults designed for quick- setup tasks. If your setup takes longer, or the value of your machine increases as it sticks around, you may want to set these values. Here are the settings we use for ElasticSearch nodes:
"uptime":{
"history": "week",
"duration": "day",
"bid_percentile": 0.95
}
history
- How much history to use: Too little history and the bids can be terminated earlier than expected, too much history will make the algorithm unresponsive to lowering prices.duration
- The window of time used to find the max price. The intent is to figure out what the bids should have been over thehistory
so that you do not get terminated for theduration
. If your workload is quick to setup, then you can set it to zero (0
).- bid_percentile - With a
history
's worth of max pricing, the question remains which price to pick: Thebid_percentile
is used to make that selection: Use0.50
(median) to make aggressively low bids: Most of the bids will fail, and there will be long hours when nothing new spins up. Use higher numbers to increase the chance of uptime. It is never wise to set this to1.00
because there is often some fool willing to bid more than on-demand.
No matter your uptime
settings, your bids will never go beyond your
budget
, and never go beyond max_utility_price
.
Configuring Volumes
Some workloads require large amounts of storage, but not all instances come with enough. The SpotManager will map the ephemeral and EBS volumes or you.
As an example, the c3.4xlarge
comes with two ephemeral drives, which can
be found at /dev/sdb
and two new EBS volumes, which will be assigned
device
properties at runtime.
{
"instance_type": "c3.4xlarge",
"utility": 15,
"drives": [
{"path":"/data1", "device":"/dev/sdb"},
{"path":"/data2", "device":"/dev/sdc"},
{"path":"/data3", "size":1000, "volume_type":"standard"},
{"path":"/data4", "size":1000, "volume_type":"standard"}
]
},
Some caveats:
- All volumes will be removed on termination - This is obvious for ephemeral drives, but the EBS will be removed too. If you want the volume to be permanent, you must map the block device yourself.
- block devices will not be formatted nor mounted. The
path
is provided only so theInstanceManger.setup()
routine can perform themkfs
andmount
commands.
Writing a InstanceManager
Conceptually, an instance manager is very simple, with only three methods
you need to implement. This repo has an example ./examples/etl.py
that you can review.
required_utility()
- function to determine how much utility is needed. Since you are the one defining utility, the amount you need is also up to you. Theexamples
uses the size of the pending queue to determine, roughly, how much utility is required.setup()
- function is called to setup an instance. It is passed both a boto ec2 instance object, and the utility this instance is expected to provide. This is run in its own thread, and multiple can be called at the same time; ensure your code is threadsafe.teardown()
- When the machine is no longer required, this will be called before SpotManager terminates the EC2 instance. This method is not called when AWS terminates the instance.
Benefits
The benefit of an bid_percentile
price point is we want a reasonable up-time with a low
price. We do not want a price set too high: we desire Amazon-initiated
termination so we get the last partial hour free. Also, some of instance
types have unpredictable and extreme price swings; SpotManager
allows you
to utilize those valleys at minimal price exposure.
The more instance types your workload can run on, the more advantage you have finding minimal pricing: Anecdotally, there is always an opportunity to be found: There is always an instance type going for significantly less than its utility would indicate.