radical-collaboration/hpc-workflows

EnTK hangs on Traverse when using multiple Nodes

lsawade opened this issue ยท 69 comments

Hi,

I don't know whether this is related to #135 . It is weird because I got everything running on a single node, but as soon as I use more than one EnTK seems to hang. I checked out the submission script and it looks fine to me; so, did the node list.

The workflow already hangs in the submission of the first task, which is a single core, single thread task.

EnTK session: re.session.traverse.princeton.edu.lsawade.018666.0003
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.traverse.princeton.edu.lsawade.018666.0003]           \
database   : [mongodb://specfm:****@129.114.17.185/specfm]                    ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   princeton.traverse       90 cores      12 gpus           ok
All components created
create unit managerUpdate: pipeline.0000 state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SCHEDULED
Update: pipeline.0000.WriteSourcesStage state: SCHEDULED
MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
                                                           ok
submit: ########################################################################
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SUBMITTING

[Ctrl + C]

close unit manager                                                            ok
...

Stack

  python               : /home/lsawade/.conda/envs/ve-entk/bin/python3
  pythonpath           : 
  version              : 3.8.2
  virtualenv           : ve-entk

  radical.entk         : 1.5.12-v1.5.12@HEAD-detached-at-v1.5.12
  radical.gtod         : 1.5.0
  radical.pilot        : 1.5.12
  radical.saga         : 1.5.9
  radical.utils        : 1.5.12

Client zip

client.session.zip

Session zip

sandbox.session.zip

Hi @lsawade - this is a surprising one. The task stdout shows:

$ cat *err
srun: Job 126172 step creation temporarily disabled, retrying (Requested nodes are busy)

This one does look like a slurm problem. Is this reproducible?

Reproduced! The message with step creation appears after a while. Meaning I continuously checked the task's error file, and eventually the message showed up!

@lsawade , would you please open an ticket with Traverse support? Maybe our srun command is not well-formed for Traverse's Slurm installation? Please include the srun command:

/usr/bin/srun --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --nodelist=/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018666.0003/pilot.0000/unit.000000//unit.000000.nodes --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params" 

and the nodelist file which just contains:

traverse-k04g10

It throws following error:

srun: error: Unable to create step for job 126202: Requested node configuration is not available

If I take out the nodelist argument, it runs

Hmm, is that node name not valid somehow?

I tried running it with the nodename as a string and that worked

/usr/bin/srun --nodelist=traverse-k05g10 --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params"

Note that I'm using salloc and hence a different nodename

I found the solution. When SLURM takes in a file for a nodelist, one has to use the node file option:

/usr/bin/srun --nodefile=nodelistfile --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params"

Oh! Thanks for tracking that down, we'll fix this!

It is puzzling though, that srun doesn't throw an error. When I do it by hand, srun throws an error when feeding a nodelist file to the --nodelist= option

@lsawade : the fix has been released, please let us know if that problem still happens!

@andre-merzky, will test!

Sorry, for the extraordinarily late feedback, but the issue seems to persist. It already hangs in the Hello, World task.
Did I update correctly?


My stack:

  python               : /home/lsawade/.conda/envs/ve-entk/bin/python3
  pythonpath           : 
  version              : 3.8.2
  virtualenv           : ve-entk

  radical.entk         : 1.6.0
  radical.gtod         : 1.5.0
  radical.pilot        : 1.6.2
  radical.saga         : 1.6.1
  radical.utils        : 1.6.2

My script:

from radical.entk import Pipeline, Stage, Task, AppManager
import traceback, sys, os


hostname = os.environ.get('RMQ_HOSTNAME', 'localhost')
port = int(os.environ.get('RMQ_PORT', 5672))
password = os.environ.get('RMQ_PASSWORD', None)
username = os.environ.get('RMQ_USERNAME', None)


specfem = "/scratch/gpfs/lsawade/MagicScripts/specfem3d_globe"

if __name__ == '__main__':
    p = Pipeline()

    # Hello World########################################################
    test_stage = Stage()
    test_stage.name = "HelloWorldStage"

    # Create 'Hello world' task
    t = Task()
    t.cpu_reqs = {'cpu_processes': 1, 'cpu_process_type': None, 'cpu_threads': 1, 'cpu_thread_type': None}
    t.pre_exec = ['module load openmpi/gcc']
    t.name = "HelloWorldTask"
    t.executable = '/bin/echo'
    t.arguments = ['Hello world!']
    t.download_output_data = ['STDOUT', 'STDERR']

    # Add task to stage and stage to pipeline
    test_stage.add_tasks(t)
    p.add_stages(test_stage)

    #########################################################    
    specfem_stage = Stage()
    specfem_stage.name = 'SimulationStage'
    
    for i in range(2):

        # Create Task
        t = Task()
        t.name = f"SIMULATION.{i}"
        tdir = f"/home/lsawade/simple_entk_specfem/specfem_run_{i}"
        t.pre_exec = [
            # Load necessary modules
            'module load openmpi/gcc',
            'module load cudatoolkit/11.0',
            
            # Change to your specfem run directory
            f'rm -rf {tdir}',
            f'mkdir {tdir}',
            f'cd {tdir}',
            
            # Create data structure in place
            f'ln -s {specfem}/bin .',
            f'ln -s {specfem}/DATABASES_MPI .',
            f'cp -r {specfem}/OUTPUT_FILES .',
            'mkdir DATA',
            f'cp {specfem}/DATA/CMTSOLUTION ./DATA/',
            f'cp {specfem}/DATA/STATIONS ./DATA/',
            f'cp {specfem}/DATA/Par_file ./DATA/'
        ]
        t.executable = './bin/xspecfem3D'
        t.cpu_reqs = {'cpu_processes': 4, 'cpu_process_type': 'MPI', 'cpu_threads': 1, 'cpu_thread_type' : 'OpenMP'}
        t.gpu_reqs = {'gpu_processes': 4, 'gpu_process_type': 'MPI', 'gpu_threads': 1, 'gpu_thread_type' : 'CUDA'}
        t.download_output_data = ['STDOUT', 'STDERR']

        # Add task to stage
        specfem_stage.add_tasks(t)
        
    p.add_stages(specfem_stage)
        
    res_dict = {
        'resource': 'princeton.traverse', # 'local.localhost',
        'schema'   : 'local',
        'walltime':  20, #2 * 30,
        'cpus': 16, #2 * 10 * 1,
        'gpus': 8, #2 * 4 * 2,          
    }

    appman = AppManager(hostname=hostname, port=port, username=username, password=password, resubmit_failed=False)
    appman.resource_desc = res_dict
    appman.workflow = set([p])
    appman.run()        
    

Tarball:

sandbox.tar.gz

Bugger... - the code though is using --nodefile=:

$ grep srun task.0000.sh
task.0000.sh:/usr/bin/srun --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --nodefile=/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018719.0005/pilot.0000/task.0000//task.0000.nodes --export=ALL,NODE_LFS_PATH="/tmp" /bin/echo "Hello world!" 

but that task indeed never returns. Does that line work on an interactive node? FWIW, task.0000.nodes contains:

$ cat task.0000.nodes 
traverse-k02g1

Yes, in interactive mode and change of the nodefile to the node I land on it works

Edit: In my interactive job I'm using one node only, let me try with two...

Update

It works also when using the two nodes in the interactive job and editing the task.0000.nodes to contain one of the accessible nodes. Either node works, so this does not seem to be the problem.

Hmm, where does that leave us... - so it is not the srun command format which is at fault after all?

Can you switch your workload to, say, /bin/date to make sure we are not looking at the wrong place, and that the application code behaves as expected when we run under EnTK?

Would you mind running one more test: interactively get two nodes, and run the command towards the other node than the one you land on.

See Update above

You should see the allocated nodes via cat $SLURM_NODEFILE or something like that (env | grep SLURM will be helpful)

echo $SLURM_NODELIST works, I don't seem to have the nodefile environment variable.

What do you mean with switching my workload to /bin/date ?

I also tested running the entire task.0000.sh in interactive mode, and it had no problem.

Slurm on Traverse seems to be working in a strange way. Lucas is in contact with the research service at Princeton.

Two things that have come up:

  1. The srun command needs a -G0 flag (no GPUs) if a non-gpu task is executed with a resource set that contains GPUs. The command will only hang if the resource set contains GPUs and run otherwise.
  2. Make sure your print statement does not contain any ! my hello world task also encountered issues because I didn't properly escape the ! in "Hello, World!". Use "Hello, World\!" instead. facepalm

Most quick debugging discussions were held on Slack but here a summary for posterity:
@andre-merzky published a quick fix for the srun command one of the RP branches (radical-cybertools/radical.pilot@aee4fb8), but there is KeyError that is issued by EnTK when calling something from the pilot.


Error

EnTK session: re.session.traverse.princeton.edu.lsawade.018720.0001
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.traverse.princeton.edu.lsawade.018720.0001]           \
database   : [mongodb://specfm:****@129.114.17.185/specfm]                    ok
All components terminated
Traceback (most recent call last):
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/execman/rp/resource_manager.py", line 147, in _submit_resource_request
    self._pmgr    = rp.PilotManager(session=self._session)
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/pilot/pilot_manager.py", line 93, in __init__
    self._pilots_lock = ru.RLock('%s.pilots_lock' % self._uid)
AttributeError: 'PilotManager' object has no attribute '_uid'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 428, in run
    self._rmgr._submit_resource_request()
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/execman/rp/resource_manager.py", line 194, in _submit_resource_request
    raise EnTKError(ex) from ex
radical.entk.exceptions.EnTKError: 'PilotManager' object has no attribute '_uid'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "solver.py", line 104, in <module>
    appman.run()        
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 459, in run
    raise EnTKError(ex) from ex
radical.entk.exceptions.EnTKError: 'PilotManager' object has no attribute '_uid'

Stack

  python               : /home/lsawade/.conda/envs/ve-entk/bin/python3
  pythonpath           : 
  version              : 3.8.2
  virtualenv           : ve-entk
  radical.entk         : 1.6.0
  radical.gtod         : 1.5.0
  radical.pilot        : 1.6.2-v1.6.2-78-gaee4fb886@fix-hpc_wf_138
  radical.saga         : 1.6.1
  radical.utils        : 1.6.2

My apologies, that error is now fixed in RP.

Getting a new one again!

EnTK session: re.session.traverse.princeton.edu.lsawade.018720.0008
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.traverse.princeton.edu.lsawade.018720.0008]           \
database   : [mongodb://specfm:****@129.114.17.185/specfm]                    ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   princeton.traverse       16 cores       8 gpus           ok
closing session re.session.traverse.princeton.edu.lsawade.018720.0008          \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 16.1s                                                       ok
wait for 1 pilot(s)
              0                                                          timeout
All components terminated
Traceback (most recent call last):
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 428, in run
    self._rmgr._submit_resource_request()
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/execman/rp/resource_manager.py", line 177, in _submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/home/lsawade/thirdparty/python/radical.pilot/src/radical/pilot/pilot.py", line 558, in wait
    time.sleep(0.1)
KeyboardInterrupt

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "solver.py", line 104, in <module>
    appman.run()        
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 453, in run
    raise KeyboardInterrupt from ex
KeyboardInterrupt

So the way I install entk and the pilot at the moment is as follows:

# Install EnTK
conda create -n conda-entk python=3.7 -c conda-forge -y
conda activate conda-entk
pip install radical.entk

(Note, I'm not changing pilot here and just keep the default one.)

Then, I get the radical.pilot repo to create the static ve.rp. Log out, log in,

# Create environment
module load anaconda3
conda create -n ve -y python=3.7
conda activate ve

# Install Pilot
git clone git@github.com:radical-cybertools/radical.pilot.git
cd radical.pilot
pip install .

# Create static environment
./bin/radical-pilot-create-static-ve -p /scratch/gpfs/$USER/ve.rp/

Log out, Log in:

conda activate conda-entk
python workflow.py

Are there any news here?

Alright, I got the workflow manager to -- at least -- run. Not hanging, yay

One of the issues is that when I create the static environment using radical-pilot-create-static-ve, it does not install any dependencies, so I installed all requirements into the ve.rp.

However, I'm sort of back to square one. A serial task executes, and task.0000.out has "Hello World" in it, and the log shows that task.0000 does return with a 0 exit code, but it also fails as indicated by the workflow manager and task.0000.err contains following line:

cpu-bind=MASK - traverse-k01g10, task 0 0 [140895]: mask 0xf set

I'll attach the tarball.

It is also important to state that the Manager seems to drop scheduling other jobs upon failure of the first task. I wasn't able to find anything about it in the log.


sandbox.tar.gz

hi @lsawade , that message in err file looks like a verbose message and doesn't indicate an error

and some additional comments:
(a) radical-pilot-create-static-ve: for dependencies there is an extra option -d (== set default modules)
(b) if for running your workflow you use a shared FS (using the same virtual env for client and pilot), then you can set that in the resource config, e.g.,

        "python_dist"                 : "anaconda",  # there are two options: "default" or "anaconda" (for conda env)
        "virtenv_mode"                : "use",
        "virtenv"                     : <name/path>,  # better to use a full path
        "rp_version"                  : "installed",  # if RCT packages are pre-installed

@andre-merzky, just as a side comment, with pre-set resource configs should we set default value for python_dist as anaconda? (since there is module load anaconda3 in pre_bootstrap_0)

Hi @mtitov , thank for getting back to me. Aah, I missed that when installing the static-ve.

that message in err file looks like a verbose message and doesn't indicate an error

That's what I thought, too. I mean the tasks finishes successfully (STDOUT Is fine). It just flags itself as failed when I run the appmanager. So, I'm a bit unsure why the Task fails.

Yeah, that what I missed, so task has the final state FAILED and has it after TMGR_STAGING_OUTPUT, thus I assume something went wrong on client side. @lsawade can you please attach client sandbox as well?

Sorry I only saw the notification now, attached the corresponding client sandbox.


client_sandbox.tar.gz

hi @lsawade , thank you for a sandbox, looks like the issue is with the name of the output: by default RP sets the name of task output as <task_uid>.out (and it is similar for err-file), before we had it as STDOUT for all tasks. For now if you want to collect corresponding outputs without using task ids, then output file name could be set explicitly, thus:

t = Task()
t.stdout = 'STDOUT'
...
t.download_output_data = ['STDOUT']

(*) With your run everything went fine, just at the end TaskManager couldn't collect STDOUT

Lord, if that ends up being the final issue, that would be wild... Let me test this later today, and I will get back to you!

So, I tested stuff yesterday, and things seem to work out! There is one catch that is probably solvable. When I have need GPUs from different nodes I feel like the mpirun in the task.000x.sh has to fail because it does not know which GPUs to use. Meaning, I want to run 2 instances of specfem simultaneously, each needs 6 GPUs, but I only have 4 GPUs per node and am running on 3 nodes (12 GPUs total). That means there is an overlap in nodes, which I don't think/am not sure about mpirun can handle by itself?

Task 1:

mpirun  -np 6  -host traverse-k04g9,traverse-k04g9,traverse-k05g10,traverse-k05g10,traverse-k05g10,traverse-k05g10 -x ...

Task 2:

mpirun  -np 6  -host traverse-k04g7,traverse-k04g7,traverse-k04g7,traverse-k04g7,traverse-k04g9,traverse-k04g9  -x ... 

Note that both use traverse-k04g9 but in the rest of the executable, there is no sign of which GPU is supposed to be used, and both Tasks never execute.

Update:

I tried to run the mpirun line in interactive mode and it hangs. I do not know why. It even hangs when I do not specify the nodes. But(!), this one does not:

srun -n 6 --gpus-per-task=1 ./bin/xspecfem3D

Just a quick update. I'm still looking for a work around here and in contact with the Research computing people here.

srun -n 6 --gpus=6 <some test task>

works, but when I do

srun -n 6 --gpus=6 ./bin/xspecfem3D

it doesn't. Very curious, but I'm on it, and will put more info here eventually.

Just a quick update, the above described commands are executed differently depending on the cluster at hand in Princeton. Meaning that it will be hard to generalize Slurm submission. I have been talking to people from picscie, there is no obvious solution right now. I will get back here again once I have more info.

@lsawade to provide an example that we can test on other clusters with SLURM.

I cannot test whether this would work, but below an example that I expect to work if slurm is configured correctly.

The jobs submit, just not in parallel. This submission setup is for 3 Nodes, where each node has 4 GPUs, and two gpu-requiring sruns have to be executed, each with 6 tasks and a 1 gpu per task. For this setup to run in parallel, the two sruns would have to share a node.

Let

#!/bin/bash
#SBATCH -t00:05:00
#SBATCH --gpus 12
#SBATCH -n 12
#SBATCH --output=mixed_gpu.txt

module load openmpi/gcc cudatoolkit

srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 0 &
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 1 &

wait

where show_devices.sh:

#!/bin/bash

echo Script $1
echo JOB $SLURM_JOB_ID STEP $SLURM_STEP_ID 
echo $CUDA_VISIBLE_DEVICES
sleep 60

Output of mixed_gpu should look somewhat like this:

Script 0
Script 0
JOB 203751 STEP 0
JOB 203751 STEP 0
0,1
0,1
Script 0
Script 0
Script 0
JOB 203751 STEP 0
0,1
JOB 203751 STEP 0
0,1
Script 0
JOB 203751 STEP 0
JOB 203751 STEP 0
0,1
0,1
srun: Step created for job 203751
Script 1
Script 1
JOB 203751 STEP 1
Script 1
JOB 203751 STEP 1
0,1
0,1
Script 1
JOB 203751 STEP 1
JOB 203751 STEP 1
0,1
0,1
Script 1
Script 1
JOB 203751 STEP 1
JOB 203751 STEP 1
0,1
0,1

and the job steps 203755.0 and 203755.1 should start at roughly the same time, unlike here:

sacct --format JobID%20,Start,End,Elapsed,ReqCPUS,JobName%20, -j 203755
               JobID               Start                 End    Elapsed  ReqCPUS              JobName 
-------------------- ------------------- ------------------- ---------- -------- -------------------- 
              203755 2021-06-25T14:51:42 2021-06-25T14:53:45   00:02:03        4         testslurm.sh 
        203755.batch 2021-06-25T14:51:42 2021-06-25T14:53:45   00:02:03        4                batch 
       203755.extern 2021-06-25T14:51:42 2021-06-25T14:53:45   00:02:03        8               extern 
            203755.0 2021-06-25T14:51:43 2021-06-25T14:52:44   00:01:01        8      show_devices.sh 
            203755.1 2021-06-25T14:52:44 2021-06-25T14:53:45   00:01:01        8      show_devices.sh 

ping

Hi @lsawade, I will have time on Friday to work on this and hope to have results back before our call.

Hey @lsawade - the reason for the behavior eludes me completely. I can confirm that the same is observed on at least one other Slurm cluster (expanse @ SDSC), and I opened a ticket there to hopefully get some useful feedback. At the moment I simply don't know how we can possibly resolve this. I am really sorry for that, I understand that this is blocking progress since several months by now :-/

Yeah, I have had a really long thread with the people from the research computing group and they did not understand why this is not working either. Maybe we should contact the slurm people?

Yes, I think we should resort to that. I'll open a ticket if the XSEDE support is not able to suggest a solution within a week.

We got some useful feedback from XSEDE after all: slurm seems indeed to be unable to do correct auto-placement for non-node-local tasks. I find this surprising, and it may still be worthwhile to open a slurm ticket about this. Either way though: a workaround is to start the job with a specific node file. From your example above:

srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 0 &
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 1 &

should work as expected with

export SLURM_HOSTFILE=host1.list
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 --distribution=arbitrary show_devices.sh 0 &
export SLURM_HOSTFILE=host2.list
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 --distribution=arbitrary show_devices.sh 1 &

where the host file look like, for example:

$ cat host2.list
exp-1-57
exp-1-57
exp-6-58
exp-6-58
exp-6-58
exp-6-58

Now, that brings us back to RP / EnTK: we actually do use a hostfile, we just miss out on --distribution=arbitrary flag. Before we include that, could you please confirm that the above also in fact works on Traverse please?

Hi @andre-merzky,

I have been playing with this and I can't seem to get it to work. I explain what I do here:
https://github.com/lsawade/slurm-job-step-shared-res

I'm not sure whether it's me or Traverse.

Can you adjust this mini example to see whether it runs on XSEDE? Things you would have to change are the automatic writing of the hostfile and how many tasks per job step. If you give me the hardware setup of XSEDE, I could also adjust the script and give you something that should run out of the box to check.

The hardware setup on Expanse is really similar to Traverse: 4 GPUs/node.

I pasted something incorrect above, apologies! Too many scripts lying around :-/ The --gpus=6 flag was missing. Here should be the correct one, showing the same syntax working for both cyclic and block:

This is the original script:

$ cat test2.slurm
#!/bin/bash

#SBATCH -t00:10:00
#SBATCH --account UNC100
#SBATCH --nodes 3
#SBATCH --gpus 12
#SBATCH -n 12
#SBATCH --output=test2.out
#SBATCH --error=test2.out

my_srun() {
  export SLURM_HOSTFILE="$1"
  srun -n 6 --gpus=6 --cpus-per-task=1 --gpus-per-task=1 --distribution=arbitrary show_devices.sh 
}

cyclic() {
  scontrol show hostnames "${SLURM_JOB_NODELIST}"  > host1.cyclic.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" >> host1.cyclic.list
  
  scontrol show hostnames "${SLURM_JOB_NODELIST}"  > host2.cyclic.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" >> host2.cyclic.list
  
  my_srun host1.cyclic.list > cyclic.1.out 2>&1 &
  my_srun host2.cyclic.list > cyclic.2.out 2>&1 &
  wait
}

block() {
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1  > host1.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host1.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host1.block.list
  
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1  > host2.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host2.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
  
  my_srun host1.block.list > block.1.out 2>&1 &
  my_srun host2.block.list > block.2.out 2>&1 &
  wait
}

block
cyclic

These are the resulting node files:

$ for f in *list; do echo $f; cat $f; echo; done
host1.block.list
exp-6-57
exp-6-57
exp-6-57
exp-6-57
exp-6-59
exp-6-59

host1.cyclic.list
exp-6-57
exp-6-59
exp-10-58
exp-6-57
exp-6-59
exp-10-58

host2.block.list
exp-6-59
exp-6-59
exp-10-58
exp-10-58
exp-10-58
exp-10-58

host2.cyclic.list
exp-6-57
exp-6-59
exp-10-58
exp-6-57
exp-6-59
exp-10-58

and these the resulting outputs:

$ for f in *out; do echo $f; cat $f; echo; done
block.1.out
6664389.1.2 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.1 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.0 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.3 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.5 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.1.4 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.1.2 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.1.1 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.1.0 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.1.3 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.1.4 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.1.5 STOP  Mon Oct 25 02:54:52 PDT 2021

block.2.out
6664389.0.1 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.0.2 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.0 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.0.3 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.5 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.4 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.0 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.0.1 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.0.4 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.0.3 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.0.5 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.0.2 STOP  Mon Oct 25 02:54:52 PDT 2021

cyclic.1.out
6664389.2.3 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.2.2 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.2.4 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.2.5 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.2.0 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.2.1 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.2.3 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.2.2 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.2.4 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.2.0 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.2.1 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.2.5 STOP  Mon Oct 25 02:55:02 PDT 2021

cyclic.2.out
6664389.3.3 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.3.5 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.3.4 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.3.0 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.3.2 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.3.1 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.3.5 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.3.3 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.3.4 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.3.2 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.3.1 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.3.0 STOP  Mon Oct 25 02:55:02 PDT 2021

So, I have some good news, I have also tested this on Andes, and it definitely works on Andes as well. An added batch_andes.sh batch script to the repo to test the arbitrary distribution for cyclic and block with nodes [1,2], [1,2] and [1], [1,2,2], respectively.

The annoying news are that it does not seem to work on Traverse. At least I was able to test whether it's a user error...

So, how do we proceed? I'm sure it's a setting in the slurm setup. Do we open a ticket with the Andes/Expanse support? I'll for sure open a ticket with PICSciE and see whether they can find a solution.

UPDATE:

The unexpected/unwanted output on Traverse:

block.1.out
srun: Job 258710 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 258710
258710.3.3 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.2 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.0 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.1 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.5 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g3: 0
258710.3.4 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g3: 0
258710.3.0 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.1 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.2 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.3 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.4 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.5 STOP Mon Oct 25 19:41:25 EDT 2021
block.2.out
258710.2.0 START Mon Oct 25 19:39:24 EDT 2021 @ traverse-k05g3: 0
258710.2.1 START Mon Oct 25 19:39:24 EDT 2021 @ traverse-k05g3: 0
258710.2.0 STOP Mon Oct 25 19:40:24 EDT 2021
258710.2.1 STOP Mon Oct 25 19:40:24 EDT 2021
cyclic.1.out
258710.0.1 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g3: 0
258710.0.0 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g2: 0
258710.0.3 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g3: 0
258710.0.2 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g2: 0
258710.0.1 STOP Mon Oct 25 19:38:23 EDT 2021
258710.0.3 STOP Mon Oct 25 19:38:23 EDT 2021
258710.0.0 STOP Mon Oct 25 19:38:23 EDT 2021
258710.0.2 STOP Mon Oct 25 19:38:23 EDT 2021
cyclic.2.out
srun: Job 258710 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 258710
258710.1.0 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g2: 0
258710.1.1 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g3: 0
258710.1.2 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g2: 0
258710.1.3 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g3: 0
258710.1.0 STOP Mon Oct 25 19:39:24 EDT 2021
258710.1.2 STOP Mon Oct 25 19:39:24 EDT 2021
258710.1.1 STOP Mon Oct 25 19:39:24 EDT 2021
258710.1.3 STOP Mon Oct 25 19:39:24 EDT 2021

Does it almost look like there is a misunderstanding between slurm and cude, the devices visible should not be all CUDA_VISIBLE_DEVICES?

PS: I totally stole the way you made the block and cyclic functions as well as the printing. Why did I not think of that...?

Ok I can run things on Traverse using this setup. But there are some things I have learnt:

One traverse to not give a job step the entire CPU affinity of the involved nodes, I have to use the --exclusive flag in srun, which indicates that certain cpus/cores are exclusively used by that job step and not anything else.

Furthermore, I cannot use --cpus-per-task=1. Which makes a lot of sense, and CPU affinity prints should have rang a bell for me. I feel dense.

So, at request, I ask SBATCH like so:

#SBATCH -n 8
#SBATCH --cpus-per-task=4
#SBATCH --gpus-per-task=1

and then

srun --ntasks=4 --gpus-per-task=1 --cpus-per-task=4 --distribution=arbitrary --exclusive script.sh

or even

srun --ntasks=4 --distribution=arbitrary --exclusive script.sh

would work.

What does not work is the following:

...
#SBATCH -n 8
#SBATCH -G 8
srun --ntasks=4 --gpus-per-task=1 --cpus-per-task=4 --distribution=arbitrary --exclusive 

For some reason, I cannot request a pool of GPUs and take from it.

Ok I can run things on Traverse using this setup. But there are some things I have learnt:
...
For some reason, I cannot request a pool of GPUs and take from it.

I am not sure I appreciate the distinction - isn't 'this setup' also using GPUs from a pool of requested GPUs?

Given the first statement (I can run things on Traverse using this setup), it sounds like we should encode just this in RP to get you running on Traverse, correct?

Well, I'm not quite sure. It seems to me that if I request, #SBATCH --gpus-per-task=1 I already prescribe how many GPUs a task uses, which worries me. Maybe it's a misunderstanding on my end..

This batch script here does not use that directive. The sbatch only needs to provision the right number of nodes - the per_task parameters should not matter (even if you need to specify it in your case for some reason) as we overwrite them in the srun directives anyway?

Exactly! But this does not seem to work!

SBATCH -n 4
SBATCH --gpus-per-task=1

srun -n 4 --gpus-per-task=1 a.o

works;

SBATCH -n 4
SBATCH -gpus=4

srun -n 4 --gpus-per-task=1 a.o

does not work!


Unless, I'm making a dumb mistake ...

Sorry, I did not work on this further, yet.

Hi @lsawade - I still can't make sense of it and wasn't able to reproduce it on other Slurm clusters :-( But either way, please do give the RS branch fix/traverse (radical-cybertools/radical.saga#840) a try. It now hardcodes the #SBATCH --gpus-per-task=1 for Traverse.

Hi @andre-merzky - So, I was getting errors in the submission, and I finally had a chance to go through the log. And, I found the error, the submitted SBATCH script can't work like this:

#SBATCH --ntasks=32
#SBATCH --ntasks-per-node=32
#SBATCH --gpus-per-task=1
#SBATCH -J "pilot.0000"
#SBATCH -D "/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.019013.0000/pilot.0000/"
#SBATCH --output "bootstrap_0.out"
#SBATCH --error "bootstrap_0.err"
#SBATCH --partition "test"
#SBATCH --time 00:20:00

In this case, you are asking for 32 GPUs on a single node. I have no solution for this because the alternative, requesting 4 tasks seems stupid. And, research computing staff seemed to be immovable in terms of SLURM settings on Traverse.

We discussed this topic on this weeks devel call. At this point we are inclined to not support Traverse: the Slurm configuration on Traverse is contradicting the Slurm documentation, and also how other Slurm deployments work. To support Traverse we basically have to break support on other Slurm resources.
We can in principle create a separate slurm_traverse launch method and pilot launcher in RP to accommodate the machine. That however is a fair amount of effort. Not insurmountable, but still, quite some work. Let's discuss on the HPC-Workflows call on how to handle this. Maybe there is also a chance to iterate with the admins (although we wanted to stay out of the business of dealing with system admins directly :-/ )

We will have to write an executor specific to Traverse. This will require allocating specific resources and we will report back once we do some internal discussion. RADICAL remains available to discuss the configuration of new machines, in case it will be useful/needed. Meanwhile, Lucas is using Summit while waiting for Traverse to become viable with EnTK.

@andre-merzky

Today I was working on something completely separate, but -- again -- I had issues with Traverse even for an embarrassingly parallel submission. It turned out that there seems to be an issue with how hardware threads are assigned.

If I just ask for --ntasks=5 I will not get 5 physical cores from the Power9 CPU, but rather 4 hardware threads from one core and 1 hardware thread from another.
So, the CPU pool on traverse by default has size 128. I have to use the following to truly access 5 physical cores:

#SBATCH --ntasks=5
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1

I will check whether this has an impact on how we are assigning the tasks during submission.

Just an additional example to build understanding:

This

#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1

is OK.

This

#SBATCH --nodes=1
#SBATCH --ntasks=33
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1

is not OK.

I have confirmed my suspicions. I have finally found a resource and task description that definitely works. Test scripts are located here traverse-slurm-repo, but I will summarize below:

The sbatch header:

#!/bin/bash
#SBATCH -t00:05:00
#SBATCH -N 2
#SBATCH -n 64
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1
#SBATCH --output=mixed_gpu.txt
#SBATCH --reservation=test
#SBATCH --gres=gpu:4

So, in the sbatch header, I'm explicitly asking for 32 tasks where each task has access to 4 cpus. In SLURM language Power9 hardware threads are apparently equal to cpus. Hence, each physical has to be assigned 4 CPUs. Then, I also specify that each core is only assigned a single task. Finally, instead of specifying somewhere implicitely some notion of GPU need, I simply tell slurm I want the 4 GPUs in each node with --gres=gpu:4.

If you want to provide the hostfile you will have to decorate the srun command as follows:

# Define Hostfile
export SLURM_HOSTFILE=<some_hostfile with <N> entries>

# Run command
srun --ntasks=<N> --gpus-per-task=1 --cpus-per-task=4 --ntasks-per-core=1 --distribution=arbitrary <my-executable>

dropping the --gpus-per-task if none are needed. Otherwise, if you want to let slurm handle the resource allocation, the following works as well:

srun --ntasks=$1 --gpus-per-task=1 --cpus-per-task=4 --ntasks-per-core=1 <my-executable>

again, dropping the --gpus-per-task if none are needed.

From past experience, I think this is relatively easy put into EnTK?

@lsawade - thanks for you patience! In radical-saga and radical-pilot, you should now find two branches named fix/issue_138_hpcwf. They hopefully implement the right special cases for Traverse to work as expected. Would you please give them a try? Thank you!

Will give it a whirl!

@andre-merzky , I find the branch in the pilot but not in saga? Should I just use fix/traverse for saga?

@lsawade : Apologies, I missed a push for the branch... It should be there now in RS also.

Hey @lsawade - did you have the chance to look into this again?

Sorry, @andre-merzky , I thought I had updated the issue before I started driving on Friday...

So, the issue persists. An error is still thrown when --cpus_per_task is used due to the underscores.


  python               : /home/lsawade/.conda/envs/conda-entk/bin/python3
  pythonpath           : 
  version              : 3.7.12
  virtualenv           : conda-entk

  radical.entk         : 1.14.0
  radical.gtod         : 1.13.0
  radical.pilot        : 1.13.0-v1.13.0-149-g211a82593@fix-issue_138_hpcwf
  radical.saga         : 1.13.0-v1.13.0-1-g7a950d53@fix-issue_138_hpcwf
  radical.utils        : 1.14.0

$ cat re.session.traverse.princeton.edu.lsawade.019111.0001/radical.log | grep -b10 ERROR | head -20
136162-1651239844.198 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : write: [   84] [   82] (cd ~ && "/usr/bin/cp" -v  "/tmp/rs_pty_staging_f19k3a1g.tmp" "tmp_jp8rdthi.slurm"\n)
136348-1651239844.202 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : read : [   84] [   60] ('/tmp/rs_pty_staging_f19k3a1g.tmp' -> 'tmp_jp8rdthi.slurm'\n)
136511-1651239844.244 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : read : [   84] [    1] ($)
136615-1651239844.244 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : copy done: ['/tmp/rs_pty_staging_f19k3a1g.tmp', '$']
136745-1651239844.245 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : flush: [   83] [     ] (flush pty read cache)
136868-1651239844.346 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : run_sync: sbatch 'tmp_jp8rdthi.slurm'; echo rm -f 'tmp_jp8rdthi.slurm'
137016-1651239844.347 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : write: [   83] [   61] (sbatch 'tmp_jp8rdthi.slurm'; echo rm -f 'tmp_jp8rdthi.slurm'\n)
137181-1651239844.352 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : read : [   83] [   91] (sbatch: unrecognized option '--cpus_per_task=4'\nTry "sbatch --help" for more information\n)
137375-1651239844.352 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : read : [   83] [   36] (rm -f tmp_jp8rdthi.slurm\nPROMPT-0->)
137514-1651239844.352 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : submit SLURM script (tmp_jp8rdthi.slurm) (0)
137636:1651239844.352 : radical.saga.cpi     : 715866 : 35185202950512 : ERROR    : NoSuccess: Couldn't get job id from submitted job! sbatch output:
137779-sbatch: unrecognized option '--cpus_per_task=4'
137827-Try "sbatch --help" for more information
137868-rm -f tmp_jp8rdthi.slurm
137893-
137894:1651239844.354 : pmgr_launching.0000  : 715866 : 35184434934128 : ERROR    : bulk launch failed
137990-Traceback (most recent call last):
138025-  File "/home/lsawade/.conda/envs/conda-entk/lib/python3.7/site-packages/radical/pilot/pmgr/launching/default.py", line 405, in work
138158-    self._start_pilot_bulk(resource, schema, pilots)
138211-  File "/home/lsawade/.conda/envs/conda-entk/lib/python3.7/site-packages/radical/pilot/pmgr/launching/default.py", line 609, in _start_pilot_bulk

@lsawade hi Lucas, can you please give it another try, since that was a typo in option setup and was fixed in that branch, thus the stack would look like this

% radical-stack           

  python               : /Users/mtitov/.miniconda3/envs/test_rp/bin/python3
  pythonpath           : 
  version              : 3.7.12
  virtualenv           : test_rp

  radical.entk         : 1.14.0
  radical.gtod         : 1.13.0
  radical.pilot        : 1.14.0-v1.14.0-119-ga6886ca58@fix-issue_138_hpcwf
  radical.saga         : 1.13.0-v1.13.0-9-g1875aa88@fix-issue_138_hpcwf
  radical.utils        : 1.14.0

@lsawade : ping :-)