Monitor for outdated ECS/EKS AMIs on the MP and alert the team
Closed this issue · 5 comments
User Story
As a MP engineer
I want to monitor the versions of ECS/EKS-optimised AMIs in use by members' clusters
So that I can warn them they are out of date
Value / Purpose
Follow on from #2413
This story would involve writing a script that can scan member environments that are using ECS/EKS to see what AMI they are using and then comparing that against the latest available optimised AMIs. This would then allow MP to see what the drift is and contact the members if their infrastructure is out of date.
This could be a lambda script rolled out to the baseline that alerts MP (or the member directly).
Useful Contacts
Additional Information
Spike: Scalability - member information #6317 for how we could do this
Proposal / Unknowns
No response
Definition of Done
- Solution agreed for how to monitor
- Solution/script developed
- Solution/script tested
- Another team member has reviewed
- Tests are green
As part of this ticket, can it be confirmed that implementation of one of the proposed options to eliminate hardcoding ECS AMIs has been successfully completed for apex, mlra and maat applications. Please see this spreadsheet for more information which is part of ticket #7188.
To monitor outdated AMIs, I'm thinking using Lambda + SSM Parameter Store could be a suitable approach. Lambda will regularly check for outdated AMIs, using SSM Parameter Store to store and retrieve relevant parameters, and then maybe send notifications via SNS if any AMIs are found to be outdated. I've started working on the script, but it still needs tweaking - please see here:
Suggested by Rich that we could perhaps do something similar to we've done for the bastion AMIs. Looked into it and it's possible to use SSM Parameter resolve syntax for ECS and EKS AMIs, just as you would with the Amazon Linux 2 AMI. This ensures that it always reference the latest ECS or EKS optimised AMIs.
So something like this:
for ECS image_id = "resolve:ssm:/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id"
for EKS: image_id = "resolve:ssm:/aws/service/eks/optimized-ami/1.23/amazon-linux-2/recommended/image_id"
This would provide automatic updates, and can be a better method than a data call as it would ensure instances are using the latest AMI without having to re-apply Terraform.
Contacted application owners that were shown to have outdated AMIs in this csv file: outdated-amis.csv
Cdpt-chaps
&cdpt-ifs
- Alistair Curtis has updated the image ID to use the resolve ssm parameter instead of data call(example:ministryofjustice/modernisation-platform-environments#8367)Apex
-contacted Vincent Cheung, and they will look into implementing the resolve ssm parameter, but informed them the issue can be resolved by rerunning terraform again as it uses data call to pull in the latest AMI.Performance-hub
- contacted Jeremy Collins, and their AMI is pretty recent (september release) but they already have a ticket in the backlog to update the code to use the ssm parameter.- Sent out emails to
tribunals
,mlra
andmaat
owners and awaiting responses.
The script for monitoring outdated AMIs has been successfully implemented which can be seen here, fetching AMI details from AWS Systems Manager and comparing them with running ECS and EKS instances. Accounts with outdated AMIs are logged in a CSV file, and I've contacted the application owners. I've created a follow on ticket here for the script to be run again, so this ticket is ready for review.