WIPACrepo/pyglidein

glidein monitoring (post-startd)

Closed this issue · 1 comments

Goal/scope: gathering any relevant monitoring information related to a particular glidein slot, once the condor STARTD has already started. I am assuming that, if condor is on, we are going to use the ClassAd mechanism to distribute this monitoring information.

Some ideas for things to monitor here:

  • GPU benchmark: result of a short (1m) gpu benchmark run at glidein startup. It can be useful for enabling normalized accounting, and also for users to "filter out" super-slow GPUS, or things like this.

  • GPU utilization: might be tricky, but it would be nice to have a measurement of GPU utilization that reflects an "average" utilization for the job duration. May be we will need to poll nvidia-smi via STARTD_CRON and compute an average... don't know.

  • ... others?

Using #104 instead because all checks will happen after startd starts.