taskcluster/taskcluster-tools

Support provisioning health and error reporting in aws-provisioner tools pages

Closed this issue · 10 comments

We've talked about this in an email thread, but I wanted to get the issue on file. The supporting work in the EC2-Manager service has been completed, awaiting deployment to our production instance. All URLs are pointing at our staging instance. I'm also working on figuring out why we're not getting API references to publish for the EC2-Manager. These are the changes that I think we should make:

  1. Split worker type details into its own page
  2. On the overall view of the AWS-Provisioner page, have a table that shows the latest errors encountered by this provisioner. The supporting API is this
  3. On the overall view of the AWS-Provisioner page, have a table that shows the 'health' of each region/az/instanceType. The supporting API is this. The values in this API response are approximate stats on the configuration. How they're displayed isn't critical to me, but I suspect highlighting rows which have non-zero error values as warnings and rows where those counts are >40% of the success rate as errors would be useful. Ideally, it could be viewed by Region, AZ and Instance Type. The idea is to be able to get an overview of an entire Region, AZ or Instance Type.
  4. On the worker type detail, add a new tab to the view called 'Health' that contains the Errors for that worker type. The supporting API is this. Basically, a worker type specific view of the other errors page
  5. On the worker type 'Health' tab, add a worker type specific overview of Region, AZ and Instance Type. The supporting API is this

This will allow Sheriffs and others interested to see what is going on inside the provisioner. This should reduce the impact on the Taskcluster team when there are problems inside the EC2 system as well as highlight when worker types are incorrectly configured.

Those details would be great to see. We just had a situation when lots of workers didn't get provisioned and it was not that clear what's going on by just looking at https://tools.taskcluster.net/aws-provisioner/. So links to health and error pages would be very helpful. Thanks!

+1 yes please! this will be super useful for diagnosing windows/occ ami issues too.

This will indeed be useful for me too - and hopefully allows sheriffs / others to better self-serve, which will also free up a lot of the time we currently spend supporting and troubleshooting on behalf of others. +++

great \o/

Also, I wanted to mention that the EC2-Manager API references are now being published, so you should be OK to generate a new tc-client-web package to include it. It's in the standard manifest.json file.

Oh, and one thing I wanted to mention is that there's fields which have a ----- HIDDEN ----- placeholder. Please render them in the UI, they might change into full error messages later on.

Given the "Health" tab, is it worth getting rid of the empty "Status" tab?

hmm, that shouldn't be empty :(

I'll investigate.