/condor_annex

There are many clouds like it, but this one is (not) mine.

Primary LanguagePerl

OVERVIEW

	We use a CloudFormation template to help track all the moving parts
(and clean them up when we're done).  The moving parts include:

The lease:
	* A CloudWatch metric, specific to each annex, whose namespace includes
	  the stack's name.
	* A CloudWatch alarm, specific to each annex, which triggers when the
	  custom metric becomes 0 or misses "enough" updates.
	* An SNS topic, specific to each annex, which ties to the preceding trigger
	  to the following Lambda function.
	* A Lambda function, the same for all annexes, triggered by the alarm. The
	  Lambda function checks if the alarm was properly triggered (to work
	  around an AWS bug where the alarm is evaluated on an update before the
	  update is evaluated): if so, it deletes the stack, otherwise it
	  resets the alarm.

The instances:
	* One or more AutoScaling Groups.  (Up to eight, given the current
	  limitations of the template engine.)
	* One LaunchConfiguration per AutoScaling Group, each created in its
	  own substack (the substack is only used as a template engine).
	* A Security Group...

Some plumbing:
	* A Lambda function, the same for all annexes, invoked as a custom
      resource, that adjusts the permissions on the previous Lambda function
      so that the SNS topic can invoke it when the alarm triggers and calls it.
    * A Lambda function, the same for all annexes, invoked as a custom
      resource, that inserts a data point into the lease metric.  (The alarm
      will otherwise never leave its initial INSUFFICIENT (data) state.)
	* Three IAM roles, one for each Lambda function, each the same for all
	  annexes.
    * An IAM Role and IAM Role Profile to grant instances permission to
      read from the private S3 bucket we created to share security tokens.

LAUNCH CONFIGURATION

Amazon does some things that make it very difficult to to automatically
generate launch configurations:

* Some instance types don't exist in some availability zones.
* Some instance types may require VPC.  You'd think we could work around this,
  but we need subnet IDs... and a subnet is availability-zone specific.  Of
  course, we don't know the names of the availability zones -- or even the
  count -- since they vary from region to region.  We could create mappings,
  but how do we create loops?  The subnet resource doesn't allow a list in the
  availability zone parameter.  Could we create another custom resource whose
  Lambda function creates a subnet in each AZ for the region it runs in?  Even
  if we could, how would we create the launch group(s)?
* AMIs vary by availability zone.  This can be worked around using mappings --
  see Amazon's standard example templates -- but it makes everything harder.

In the glorious future, when we have our own AMIs, we can use the Amazon trick
to choose one, but it's wildly unclear how to deal with the VPC subnet
requirements and launch group variability.

[Update: what we do now is list all selected VPC subnets in each launch
configuration and let Amazon figure it out.  This works as far as it goes,
but of course requires VPC.  (Do any instance types require EC2 Classic?)
This doesn't help with the problem of some instance types not being available
in some AZs, but since VPC subnets are tied to AZs, each AutoScaling Group
is at least "trying" to launch its instance types in each AZ.]

DEPENDENCIES

	Note that the "DependsOn" attributes exist to force the LeaseFunctionRole
to be deleted last, because once it's deleted, CloudFormation no longer has
the permissions to delete anything else.

FIXME

[] We need to look into which users on the machine (current set up has
   user nobody running jobs) can act as the role associated with the machine,
   e.g., download the credentials from S3.

[] If condor_annex fails to create the stack, it should check once more if
   the stack already exists (to help catch simultaneous uses).

FEATURE REQUESTS

* When creating an AWS::SNS::Topic resource with a Lamba-protocol subscription,
  automatically add permissions on the referenced function to allow that topic
  to Invoke([.*]?) it.
* The Cloud Formation web console should check parameter validity earlier
  and write out constraint failure messages next to the corresponding parameter.
* Add an intrinsic function to refer to arbitrary parameters in other resources.
* An AWS::Lambda::Function should be able to specify the duration of its logs.
  It's really hard to do this right now, because the log won't actually exist
  until the Lambda function runs, which could happen at any time.  This may
  actually be a feature request for CloudWatch -- I'd like to set the default
  retention policy for a log group based on its name (and/or source).
* Delete generated subscriptions with the rest of the stack.
* Access to the ARN of a security group.  (Using Fn::Join is awkward.)
* Deletion order dependency.  In particular, if I create an IAM role for a
  Lambda function, it has to be the last thing deleted, because otherwise
  CloudFormation loses the privileges to finish the deletion.  However, that
  IAM role should have its resources restricted based on resources created
  for the stack...

- I can't grant an instance permission to detach only itself, or even any
  instance only from its own autoscaling group.  :(

* [2016-02-18]  CloudFormation doesn't (yet) allow per-instance type bids when
  creating a Spot Fleet.
* [2016-02-18]  It would be nice to be able to create Spot Fleets with an
  initial size of 0.
* [2016-02-18]  It would be really nice to be able to specify defaults with
  functions (Fn::Join, Fn::Ref, and Fn::FindInMap in particular).  This
  would be doubly nice with a function to look up arbitrary values from
  any resource in the template (see above).
* [2016-02-18]  foreach() would be nice for use with comma-separated lists.
* [2016-02-18]  It would be nice if AWS::NoValue disappeared from inside
  object lists, instead of giving Spot Fleet fits.
* [2016-02-18]  Some of these issues may be better resolved by allowing me
  to add (existing) resources to a stack, rather than predeclare them all.
* [2016-02-18]  Being able to specify a periodic timer (CloudWatch Event)
  in CloudFormation would be really nice.
? [2016-02-18]  Specify propogated user tag(s) for the stack.  (Currently
  not propogated to EBS volumes.)  Unclear how or if possible to tag the
  whole stack.
* [2016-03-11]  In particular, it looks like a tagged autoScaling Group
  or Spot Fleet that propogates tags to its instances doesn't tag the
  Spot instance requests by which it acquires those instances.

NOTES

* I could possibly use code to create the log group each Lambda function
  will by convention create, and set the retention policy there?
* I would like to narrow the resource permissions granted to the IAM roles,
  but some actions don't support that, and others cause circular dependency
  problems -- see above.