WIPACrepo/pyglidein

monitoring for glidein startup (before the startd connects)

Closed this issue · 5 comments

We should monitor the glidein startup, before the startd connects.

I do not know much about designing scalable distributed systems, but it sounds to me that just opening the graphite port to to the world, and having potentially thousands of processes sending data there from anywhere in the world, is going to have some issues.

Isn't this a classic use case for a messaging system? Can we consider using something like apache kafka for sending information to logstash/ES in a reliable way?

After talking with @dsschult and @gonzalomerino we decided on a scope for this issue:

  1. For GPU slots, ensure the CLSIM benchmark runs successfully.
  2. Ensure CVMFS is accessible from the node
  3. Ensure the gridftp service port is accessible from the node.

These scripts should be executed if possible by condor between jobs and their outputs should be added as classads. Jobs shouldn't start if any of those three checks fail.

Adding #111 to this ticket.