WIPACrepo/pyglidein

catch exceptions so we don't die

dsschult opened this issue · 3 comments

@philippeller says:

I'm running the pyglidein client as a standalone process with the delay = ... seconds option.....every now and then a qsub command fails and excepts.....

Traceback (most recent call last):
  File "./client.py", line 198, in <module>
    main()
  File "./client.py", line 174, in main
    scheduler.submit(s, partition)
  File "/storage/home/pde3/pyglidein/submit.py", line 336, in submit
    raise Exception('failed to launch glidein')
Exception: failed to launch glidein

Could we change the bahviour of the exception handling that the client stays alive?

Just put a general try / except catch around the entire client so it can loop properly.

At my site I explicitly turn warnings into errors with qsub -w e, as this usually means that I need to intervene because the client is submitting jobs that will never be serviced. You can also make qsub fail gracefully with qsub -w w (if you want to know about it) or qsub -w n (if you don't).

I'd argue that the client should know that it didn't launch a glidein, but we should continue running in the hope that it's transient.

Ideally we'd just pass the failure info to monitoring, which would alert the human.