daimh/sge

Info Request: SGE Implementations

Closed this issue · 4 comments

Hi Some Grid Engine Maintainers,

I am a System Administrator at the University of California in San Francisco.

We run an HPC cluster based on the latest/last version of Son of Grid Engine.

https://wynton.ucsf.edu/hpc/

We've been surveying resources for scheduler options outside of Slurm and ran across your github project.

Our group had a couple questions about your SGE fork.

  1. SystemD support.

"If they're running the sge_execd on the nodes via SystemD, that actually has some unintended consequences. In our version of SGE (and Altair's), you can run "service sgeexecd softstop" and that stops the currently running execd without killing the jobs on the node. That's handy if you need to restart the execd for some reason (which happens)."

  1. We have run into a problem with resource complexes, particularly as they relate to GPUs. This bug forced us to use a less than desired workaround for GPU scheduling. Is this anything you have seen in your SGE cluster?

Described in this email thread from 2018:

http://gridengine.org/pipermail/users/2018-April/010116.html

More information here:

http://gridengine.org/pipermail/users/2018-April/010127.html

Thanks for any insight!

daimh commented
  1. As the systemd unit file in this SGE repo doesn't change killmode, the command 'systemc stop sgeexecd' will kill all running jobs. You might want to add a line 'killmode=process' to /etc/systemd/system/sgeexecd.service to see if it works for you, although systemd web site notes this mod is not recommended.
  2. I would enable schedd_job_info as in the link below, and then run qstat to check out the scheduling information if it happens again.

http://www.softpanorama.org/HPC/Grid_engine/Troubleshooting/enabling_scheduling_infomation_about_jobs_in_sge_execd.shtml

daimh commented

In terms of softstop, I forgot to mention you still can run something like '/opt/sge/default/common/sgeexecd softstop', instead of use systemctl.

Also worth of mentioning that similar resolution can be obtained by running jobs via SystemD (i.e. use_cgroups=systemd) as this way each job is started as independent SystemD unit which has no dependency on sgeexecd

daimh commented