DOWN vertices not reconsidered after coming back UP
Closed this issue · 6 comments
On rzadams I marked a rabbit vertex as down, and submitted a job that required a compute node on the same rack as that rabbit. The job, as expected, was stuck in SCHED. I then marked the rabbit vertex as UP, but the job remained stuck in SCHED. A new job went through fine.
It seems to me that the original job should have had its resource request reconsidered at some point?
Can you provide some of the job details and resource requests? Did the job have constraints? Also, which resource reader is Fluxion using on rzadams?
Also, which resource reader is Fluxion using on rzadams?
After a bit more thought, it has to be JGF given the use of rabbits.
A better question is what is the configured match policy?
I'll check the match policy, but I'll also see if I can reproduce locally.
Having just gone through this, my best bet would be that somehow the resource never got marked as UP in resource. We definitely reconsider jobs when resources change state, in fact we reconsider all jobs when a resource is set DOWN even, so it's probably a failure to propagate that state or an issue with the matching.
Is this still reproducible @jameshcorbett? We've made the reconsideration much more robust in the last month or so.