allow arbitrary header for submission script in DispatcherExecutor
zezhong-zhang opened this issue · 6 comments
Summary
I know this is strange, but our cluster put a PBS layer on top of the Slurm system. However that PBS layer is not exactly the same as the genuine PBS, this brings several issues:
- The header and qstat has some keys missing or different, I cannot simply uses PBS as
batch_type
. - The submission of CPU/GPU job also has to load the corresponding module, instead of specify the queue partition in normal cluster. This means before the job submission the shell has to run like
module swap cluster/dodrio/gpu_rome_a100
.
To address the problem, I am wondering:
- Is it possible to allow arbitrary header in DispatcherExecutor as in SlurmRemoteExecuton?
- Where to find the job submission and get status in the code so that I can change them accordingly.
I modified those in pbs.py
in dpdispatcher to run the remote submission and retrieve correctly, but not so sure about where to find them in DispatcherExecutor.
Refer to DPDispatcher's documents.
custom_flags
in resources passes extra lines to job submitting script header. Specify byDispatcherExecutor(resources_dict={"custom_flags": "xxx"})
.module_list
assigns modules to be loaded on HPC system before submitting jobs. Specify byDispatcherExecutor(resources_dict={"module_list": ["xxx"]})
.- For the modification of dpdispatcher, the standard operation procedure to make it work will be
- Build a new docker image with the modified code using the Dockerfile in dpdispatcher's repo.
- Push the docker image to a registry. E.g. a repository on Docker Hub yourname/dpdispatcher:patched.
- Specify dispatcher image by
DispatcherExecutor(image="yourname/dpdispatcher:patched")
I will close this issue. Feel free to reopen it if you still have problems.
Thanks for your advice! My apologies for my delayed action.
- I patched the dpdispatcher to adapt our cluster, https://github.com/zezhong-zhang/dpdispatcher.git@hortense. I tested that indeed it can submit and retrieve the job correctly. I then build the docker image, push it to the docker hub.
- I initialize the
DispatcherExecutor
with
dispatcher_executor = DispatcherExecutor(
host="host",
username="accout",
private_key_file="path_to_rsa",
resources_dict=resources_dict,
machine_dict=machine_dict,
)
- I used the same machine file and resources file as I did in the step1. If I already provide the account/password when init
DispatcherExecutor
as above , I assume do I need to put them inremote_profile
inresources_dict
again?
machine_dict
{
"batch_type":"PBS",
"number_node": 1,
"cpu_per_node": 1,
"gpu_per_node": 0,
"group_size": 0,
"custom_flags":[
"#PBS -l walltime=24:00:00",
"#PBS -A project",
],
"prepend_script":["load_gpu_module"],
"module_list":["load_relevent_module"],
"envs":{"OMP_NUM_THREADS":"12","TF_INTRA_OP_PARALLELISM_THREADS":"1","TF_INTER_OP_PARALLELISM_THREADS":"12"}
}
resources_dict
{
"batch_type": "PBS",
"context_type": "SSH",
"remote_root": "path_to_benchmark",
"remote_profile":{
"hostname": "host",
"username": "accout1",
"key_filename": "path_to_rsa",
"port": 22
}
}
from dflow import upload_artifact, Workflow, Step
from dflow.python import PythonOPTemplate, OP, OPIO, OPIOSign, Artifact
from dflow.plugins.dispatcher import DispatcherExecutor
from pathlib import Path
class Duplicate(OP):
def __init__(self):
pass
@classmethod
def get_input_sign(cls):
return OPIOSign({
"msg": str,
"num": int,
"foo": Artifact(Path),
})
@classmethod
def get_output_sign(cls):
return OPIOSign({
"msg": str,
"bar": Artifact(Path),
})
@OP.exec_sign_check
def execute(
self,
op_in: OPIO,
) -> OPIO:
with open(op_in["foo"], "r") as f:
content = f.read()
with open("bar.txt", "w") as f:
f.write(content * op_in["num"])
op_out = OPIO({
"msg": op_in["msg"] * op_in["num"],
"bar": Path("bar.txt"),
})
return op_out
import json
with open('resources.json') as json_file:
resources_dict = json.load(json_file)
with open('machine.json') as json_file:
machine_dict = json.load(json_file)
dispatcher_executor = DispatcherExecutor(
host="host",
username="accout",
private_key_file="path_to_rsa",
resources_dict=resources_dict,
machine_dict=machine_dict,
)
with open("foo.txt", "w") as f:
f.write("Hello world!")
step = Step(
"duplicate",
PythonOPTemplate(Duplicate,image="dreamleadsz/dpdispatcher:hortense"),
parameters={"msg": "Hello", "num": 2},
artifacts={"foo": upload_artifact("foo.txt"),},
executor=dispatcher_executor,
)
wf = Workflow(name="slurm")
wf.add(step)
wf.submit()
error:
Traceback (most recent call last):
File "/argo/staging/script", line 96, in
machine = Machine.load_from_dict(json.loads(r'{"batch_type": "PBS", "context_type": "SSH", "local_root": "/", "remote_profile": {"hostname": "tier1.
hpc.ugent.be", "username": "account", "port": 22, "timeout": 10, "key_filename": "path_to_rsa"}, "remote_root": "/dodrio/scratch/users/account/2023/benchmark/md"}'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dpdispatcher/machine.py", line 142, in load_from_dict
context = BaseContext.load_from_dict(machine_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dpdispatcher/base_context.py", line 45, in load_from_dict
context = context_class.load_from_dict(context_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dpdispatcher/ssh_context.py", line 454, in load_from_dict
ssh_context = cls(
^^^^
File "/usr/local/lib/python3.11/site-packages/dpdispatcher/ssh_context.py", line 427, in init
self.ssh_session = SSHSession(**remote_profile) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dpdispatcher/ssh_context.py", line 51, in init
self._setup_ssh()
File "/usr/local/lib/python3.11/site-packages/dpdispatcher/utils.py", line 179, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dpdispatcher/ssh_context.py", line 213, in _setup_ssh
raise RuntimeError("Please provide at least one form of authentication")
RuntimeError: Please provide at least one form of authentication
No need to specify hostname, password, key_filename, port, etc again in the machine dict. Specially, path of the private key file in the container is generally different from that local. key_filename is specified by dflow automatically. Specifying key_filename in the machine dict will cause wrong path for the key file.
Solved, the patched dispatcher image should specify in the DispatcherExecutor
, not the PythonOPTemplate
. We can close this now.