SMART-Lab/smartdispatch

Mismatch between PBS and moab job ID on some clusters

Closed this issue · 4 comments

On Calcul Québec's Helios cluster, the PBS job ID assigned by Torque and the moab job ID do not match. Right now, smart-dispatch displays and writes into job_id.txt the job returned by qsub (from moab), while the workers log the PBS_ID from Torque, which is available on all servers.

Smart-dispatch should output consistent job IDs. When running on clusters like Helios, that would require using the ID returned by qsub to find the PBS ID and display it.

Right. I've also observed that on Helios. The ids contained in job_id.txt can't be used with the qdel command which is annoying.

I totally agree.
Before that, we should investigate if there is a direct link between certain tools and which jobID is used.
Tools like msub vs qsub, qstat, showq etc this will tell us if we should always report both ids or if we should "hide" the mismatch from the user and only report one.

I did on quick check on Colosse, which also uses msub. There qstat -f and the rest of the system output a single job id in PBS style, as expected. I haven't found traces of a separate MOAB ID.

So from my very representative two data points, I get the feeling it's a config quirk in Helios more than a feature of msub.

Can we close this one now that #139 is merged?