giantswarm/aws-operator

CloudFormation for Workers

asymmetric opened this issue · 20 comments

We decided to try out CloudFormation on the workers ASG. The initial implementation will deal with scaling the workers up/down.

Benefits:

  • Declarative
  • Easier to do rolling updates when the CC changes
  • Allows us to simplify the operator

Cons:

  • Rolling updates are only triggered on LC change. How do we tie the 2 CCs together, so that the LC changes when the big CC changes?
  • Passing in variable-sized parameters (e.g. ELB listeners) is very cumbersome. We might want to use a Go template with proper for loops, instead of relying on AWS' broken system.

Limits:

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cloudformation-limits.html

Reading the docs, it seems like rolling updates only get triggered on a few events, notably when the LC changes.

This is not what will happen in our case though, as what will change is the big CC, not the one we pass as UserData.

Next step is investigating whether we can trigger a rolling update anyway.

/cc @rossf7

For upgrading clusters we may need to version the large cloudconfig. If so can we change the small cloudconfig so it's different each time? If we can I'd like to trigger the rolling update the standard way.

Good to know about the limits. There are bound to be some downsides over using the Go SDK directly.

Versioning sounds like a good idea.

By the way, won't the work in https://github.com/giantswarm/giantswarm/pull/1541 obviate our need to have 2 CCs? If certs are delivered through some other mean (i.e. not in the CC), then maybe we won't hit the 16KB limit in the first place.

Putting the large cloudconfig hash in the small one would make sense.

@rossf7 as you mentioned in https://github.com/giantswarm/giantswarm/pull/1541#discussion_r122764100, yes, I think we could simply download the certs from S3 from within the small CC.

I think this change to the how we fetch certs should be done as part of this ticket. Either that or versioning both CCs.

That's because IMO the most meaningful way to test whether CF works for us is to use it for rolling updates in our ASG.

WDYT?

Totally agree on needing to give CF a meaningful test. I think we should version both CCs rather than downloading the certs separately.

Downloading the certs is only needed for rolling certs which has lower priority. It also has a bigger impact on KVM.

Status Update

We create a template that contains the LC and the ASG. Other inputs (the LB, the SG rules etc) are still created via the API, as I haven't found an easy way to pass in arguments with variable length (e.g., the SG rules).

To work around that, we might want to run the AWS template through Go's templating system, which would allow us to use iterators.

The creation of the stack via the template completes successfully.

Issue

How does the aws-operator know when a cluster is running an old CC, and therefore needs to be updated?

One option is using annotations. Basically, in the onAdd handler (which also runs on each resyncPeriod) we:

  • produce the big CC
  • calculate its checksum
  • check if the cluster already exists
  • if no:
    • annotate the cluster with the checksum
  • if yes:
    • compare the checksums
    • proceed with updating the ASG if needed

We want to go with the checksum approach for detecting if the CC has changed. For storage there are 2 options. Add the checksum to the TPO itself or use an annotation.

The benefit of adding it to the cluster TPO is its part of the state of the cluster. This means the operator is updating TPOs it manages. Discussed this with @xh3b4sd and we don't think this is a problem.

So I'd rather go with this instead of the annotation. Let me know if this causes major problems or is going to be a lot more work.

@rossf7 updating TPO might be a problem for "consistent deletes". We want to store the content of the TPO in a configmap. Maybe handling the update function can help, but I'm not sure yet. Just keep that in mind.

Can the name of an TPO be updated? I only see this case breaking the story. It is also that the configmap tracking TPOs to be deleted can be updated on update events. So we should be good.

The name of the TPO shouldn't change just the checksum. Yes we should update the config map tracking TPOs when we implement the consistent deletes.

OK.

I think it could make sense later on to split the whole creation of the CC out of the operator, and into its own service. That way, the CC could simply be checksummed and uploaded to an S3 bucket, and the aws-operator would just fetch it and be done with it.

I think a pre-condition for this is making the operator compliant with operatorkit.

I say this because we have to start reconciling the state in the event handlers (e.g. looking at existing clusters, comparing the checksums, making necessary changes) and doing this without first moving to the operatorkit model would be a waste of resources IMO.

So if you have nothing against it, I'm going to focus the rest of the week on other issues in the backlog.

Yes I think we should use the reconciler support in operatorkit. But I think upgrading aws-operator to use the latest operatorkit needs to be a separate change.

We still need to evaluate how much Cloud Formation helps but also if it introduces new problems. We should do handover on that. In the meantime working on backlog issues makes sense.

Status Update

Verifying that the stack was created

We use the builtin WaitUntilStackCreateComplete function. In a previous commit, I implemented it myself. The advantage was that we had more control over the timeout and logging.

Rolling update of the ASG

Rolling updates in an ASG are only triggered when there's a change to any of the parameters of its LaunchConfiguration, and provided that the ASG has enabled the UpdatePolicy: AutoScalingRollingUpdate property.

One simple way to test that this works in the current branch is to go to the page of the Stack created by the operator, and update it by changing one of its parameters (e.g. AssociatePublicIPAddress).

The parameter we are most interested in is UserData, but updating that from the operator is too big of a task for the time I have left.

General feedback on using CloudFormation

Right now, we only use the template for the ASG and the LC. This is because of the drawbacks explained below (mainly, that the templating is too rigid for our usecase)

Pros

  • out-of-the box support for rolling updates of the instances in the ASGs
  • declarative instead of imperative
  • less code in the operator

Cons

  • the templating system (i.e. the functions AWS provides on top of the bare YAML) is barebones. For example, it does not support passing in a list of security group rules, which we need if we want to make a template generic.
  • this could be obviated by using Go's template package
  • the other possibility is to keep creating some of the resources in the SDK, and the rest via CF (which is what's happening in my branch)

Status Update

By using go templates, I was able (1, 2) to easily pass Listeners and Security Group rules to the CF template.

It is definitely possible to go further this way, eliminating code from the operator.

I decided to only use go templates when AWS parameters system was too cumbersome. In all other scenarios, I stuck to the default way of doing things.

I didn't remove the old, manual code from the operator's service/create/service.go for lack of times, but there's plenty that can be removed.