keikoproj/instance-manager

Automated AMI upgrades

preflightsiren opened this issue · 10 comments

Is this a BUG REPORT or FEATURE REQUEST?: Feature Request

Amazon publish a supported, EKS optimised AMI and an associated SSM parameter that is updated to point to the latest AMI for a given kubernetes version.

aws ssm get-parameter --name "/aws/service/eks/optimized-ami/1.18/amazon-linux-2/recommended/image_id" --region us-east-1 --query "Parameter.Value" --output text

We could have instance-manager query the EKS API to determine the kubernetes version of the cluster, query the latest AMI and replace the image field.

This could then be coupled with an automated upgrade process like upgrade-manager, so that nodes are automatically replaced as a new AMI.

Interesting idea @preflightsiren
Maybe we can support a ‘latest’ value for image field.
We already have the DescribeCluster payload available.
Just need the ssm call to get the latest image.

A few thoughts on use cases / edge cases for this feature:

  • Probably want to support a configurable worker k8s version - a 'surprise' upgrade during a master node version upgrade is probably not what folks will want.
  • It'd be nice to have some kind of approval or delay mechanism, so operators could choose when to roll out a new change instead of it happening right after ssm is updated.
  • On a similar note, it'd be useful to track version information about what AMI was used for operators to rollback if they need to.

yeah, thinking about what I'd like to know, operating a cluster with this feature:

  • What is the current AMI?
  • What was the previous AMI?
  • When did they change?
  • How many nodes are running the current AMI?
  • How many nodes are running an AMI other than the current?

A few thoughts on use cases / edge cases for this feature:

* Probably want to support a configurable worker k8s version - a 'surprise' upgrade during a master node version upgrade is probably not what folks will want.

* It'd be nice to have some kind of approval or delay mechanism, so operators could choose when to roll out a new change instead of it happening right after ssm is updated.

* On a similar note, it'd be useful to track version information about what AMI was used for operators to rollback if they need to.

That might be a good feature even separately @backjo.
This can be useful in cases you don't want an upgrade to be triggered immediately when you make a change e.g. when new controller version includes user-data change.

Perhaps we can introduce a generic mechanism to allow users to stagger upgrades.
example, existence of annotation instancemgr.keikoproj.io/lock-upgrades: true would avoid entering an upgrade state, when the annotation is removed that will trigger another reconcile which can then process the upgrade.

This can allow users to now chose the behavior they want - automatic upgrades without notification (with the case of latest AMI), or with use of a 'lock' that allows to trigger the upgrade by removing/adding an annotation.

Regarding tracking which AMI is used - we add a label now to nodes with the AMI in use, so even if the IG says "latest", the nodes will show the actual AMI via labels, we can also add a status field with the resolved AMI.

WDYT

Just getting back to this - I really like what cert-manager did with the concept of CertificateRequests and approval (https://cert-manager.io/docs/concepts/certificaterequest/#approval)

We could do a similar concept - have a concept of an UpgradeRequest where an IG that wishes to upgrade is gated by the approval of such a request. We could similarly have an internal controller (enabled by default) that auto-approves UpgradeRequest objects, but advanced users could choose to disable this controller and implement custom logic - could be a slack webhook, a custom controller etc - it'd really be up to the user.

Also could work with lock-upgrades as well, but in general I'm a fan of this approach where upgrades are more of an 'approval' then a manual update to the ami

do you mind if we split this into the 2 parts?:

  1. AMI ID updating from aws ssm params
  2. UpgradeRequest approval gate

@backjo
Can you clarify the suggestion a bit?
Are we talking about having the controller create a CR InstanceGroupUpgradeRequest whenever we detect a drift, and then ending the requeue.
Then a secondary reconciler in the controller that watches InstanceGroupUpgradeRequest objects, can reconcile this object when someone modifies the CR to Approved: true or Denied: true?

Should the point in time should be before creation of new Launch Config or after?

Approval mechanism can then be configurable in the spec of InstanceGroup to either be manual (setting the CR to approved), or slack / email / webhook / etc.

I think this makes a nicer interface, but there are some considerations/questions and tradeoffs here:

  • Primary tradeoff is that with the annotation approach you can actually use the configmap to lock/unlock all IGs if you choose to (which is a more relevant use case), if you have 100 IGs, you would have to otherwise change 100 InstanceGroupUpgradeRequest - we should consider what a global option would be here. Annotation /w configmap gives you both single and global option.

  • Instead of automatic approval, should we consider when this mechanism is not enabled to keep the existing behavior? creating objects that get auto approved seems redundant. This might be problematic since adding a reconciler is a controller level option - meaning we would be forced to always submit/reconcile these objects with the auto-approval.

  • What happens when an upgrade request is denied? Would the user need to make another IG change? or delete the denied InstanceGroupUpgradeRequest so that a new one can be created?

  • What happens to old InstanceGroupUpgradeRequest objects? does the controller delete these objects at some point? if not and we use unique names, what is the mechanism for cleaning those up?

@eytan-avisror how you describe it is largely how I imagine it working.

I wonder if we can combine these two approaches - maybe a hybrid approach here works the best.

  • An update to an IG happens
  • If the IG is not annotated as 'locked', update proceeds as implemented today.
  • If the IG is annotated as locked, a CR is created and is in a pending state.
  • If the annotation for lock is removed from the IG, all CRs for the IG are removed, and the IG will reconcile as normal today.
  • If the request is denied, a change will be needed to the IG to trigger a new upgrade request.
  • Once the IG transitions back to ready state after an update, the CR gets removed.

how do you guys feel about: #327 (comment)

Just looking through https://docs.aws.amazon.com/systems-manager/latest/userguide/parameter-store-public-parameters-eks.html

The ssm GetParameter response contains LastModifiedDate. This might be useful to create policies to allow teams to set >different delays updating. Eg. Dev clusters immediately get new AMIs, staging clusters after 48h, prod after 1week.

What do you think?

we could include a delay after which an upgrade can occur

That’s a pretty good idea.. I think especially interesting in the case of dev clusters.
Feel free to open a separate issue to explore that idea and propose a design.. I think may be appropriate to add this field as a global on the configmap somewhere such as a “schedules” object.