radical-collaboration/hpc-workflows

ENTK hangs on Summit

Closed this issue ยท 44 comments

Hi My current entk python script hangs when running on summit.

I copied my files to world shared directory here so you may replicate the my tests.

/gpfs/alpine/geo111/world-shared/lei/entk

I also prepared a bash script for you to launch job directly.

/gpfs/alpine/geo111/world-shared/lei/entk/specfem/job_solver.bash

The system modules I used:

module load gcc/4.8.5
module load spectrum-mpi
module load hdf5/1.8.18
module load cuda

module load zlib
module load sz
module load zfp
module load c-blosc

At the line number 46 of run_job.py, schema would be local instead of jsrun.

As per slack exchange: the pilot sees a SIGNAL 2 while running - no other ERROR logs, no suspicious *.err/*.out files

@lee212 I actually used 'local' on summit...sorry, jsrun is just one try... I will edit it back to remove confusion.

@wjlei1990 , thanks for the confirmation. No worries, I used your script on my account and have a similar issue.

Thanks for the help :)

queue is missing e.g. "queue":"batch" in the res_dict?

OK Let me try now...

I tried adding "queue":"batch" and the job still hangs...

Talked to @wjlei1990, issue Iโ€™m facing in Tiger #95 seem to be the same.

@lee212 : you said you were able to reproduce this, right? Do you already have any idea whats up?

This seems related to resource over-allocation, and the correct description would be:

    res_dict = {
        'resource': 'ornl.summit',
        'project': 'GEOxxx',
        'schema': 'local',
        'walltime': 10,
        'cpus': 168,
        'gpus': 6,
        'queue': 'batch'
    }

and the task resource would be:

    t1.cpu_reqs = {
        'processes': 6,
        'process_type': 'MPI',
        'threads_per_process': 4,
        'thread_type': 'OpenMP'}

    t1.gpu_reqs = {
            'processes': 1,
            'process_type': None,
            'threads_per_process': 1,
            'thread_type': 'CUDA'}

this will result in:

rank: 0: { host: 1; cpu: {0,1,2,3}; gpu: {0}}
rank: 1: { host: 1; cpu: {4,5,6,7}; gpu: {1}}
rank: 2: { host: 1; cpu: {8,9,10,11}; gpu: {2}}
rank: 3: { host: 1; cpu: {12,13,14,15}; gpu: {3}}
rank: 4: { host: 1; cpu: {16,17,18,19}; gpu: {4}}
rank: 5: { host: 1; cpu: {20,21,22,23}; gpu: {5}}

Hi could you explain to me why in res_dict, the value of cpus is 168.


Upates on the current test status:

  1. the single job of CPU and GPU are working, and the running time is expected.

  2. I am testing the multiple tasks running at the same time, and entk seems not happy with it. I got job hangs still.

Summit compute nodes have (2) 22-core Power9 CPUs where each core supports 4 hardware threads, resulting in 168 = 2 * (22 - 1) * 4. 1 core on each socket has been set aside (-1) for overhead and is not available for allocation through jsrun.

I will look into the hangs on multiple tasks.

This is critical for Summit allocation renewal. Data need to be ready that show we are running in production on Summit with EnTK.

@wjlei1990 to address this we will need to reproduce your issue. Unfortunately, this means we will need some information from you:

  • batch script (without EnTK) that correctly executes your executable. This will be our baseline.
  • The workflow you are trying to use with EnTK with instructions on how to run it.

We will run your workflow first with a single task, comparing it to our baseline and confirming the behavior your reported. We will then run the same workflow with two concurrent tasks to confirm the issue you report.

@lee212 do you see anything else you will need to debug this?

Summit compute nodes have (2) 22-core Power9 CPUs where each core supports 4 hardware threads, resulting in 168 = 2 * (22 - 1) * 4. 1 core on each socket has been set aside (-1) for overhead and is not available for allocation through jsrun.

I will look into the hangs on multiple tasks.

I think this may need some corrections.

I just tried one simulation which used 384 GPUS and 384 CPU cores. On summit the job should use 384/6=64 nodes. However, if I asked for 64 * 2 * (22-1) * 4 = 10752 CPUs and 384 GPUs, the slurm showed the job asked for 128 nodes, which is not correct.

My resource alllocation:

    res_dict = {
        'resource': 'ornl.summit',
        'project': 'GEO111',
        'schema': 'local',
        'walltime': 30,
        'gpus': 384,
        'cpus': 10752,
        'queue': 'batch'
    } 

    t1.cpu_reqs = {
        'processes': 384,
        'process_type': 'MPI',
        'threads_per_process': 4,
        'thread_type': 'OpenMP'}

    t1.gpu_reqs = {
        'processes': 1,
        'process_type': None,
        'threads_per_process': 1,                                               
        'thread_type': 'CUDA'}

Could you pls double check it?

@wjlei1990 to address this we will need to reproduce your issue. Unfortunately, this means we will need some information from you:

  • batch script (without EnTK) that correctly executes your executable. This will be our baseline.
  • The workflow you are trying to use with EnTK with instructions on how to run it.

We will run your workflow first with a single task, comparing it to our baseline and confirming the behavior your reported. We will then run the same workflow with two concurrent tasks to confirm the issue you report.

@lee212 do you see anything else you will need to debug this?

  • the baseline example is located here:
    /gpfs/alpine/world-shared/geo111/lei/entk/specfem3d_globe_990cd4.
    The bash script to launch the job is: job_solver.bash. You may directly submit and run it, to use as the baseline.

  • The current entk python script I used is:
    /gpfs/alpine/world-shared/geo111/lei/entk/run_entk.py. It has only 1 stage and 1 task in that stage. The task is a forward SPECFEM simulation.

@wjlei1990 : thanks for the batch script! What is the runtime we are expected to see? I don't mind running this test, but getting jobs of that size will always take a bit and will burn some allocation. If you happen to have a smaller test case available, let us know please :-)

@wjlei1990 : thanks for the batch script! What is the runtime we are expected to see? I don't mind running this test, but getting jobs of that size will always take a bit and will burn some allocation. If you happen to have a smaller test case available, let us know please :-)

Hi Andre, the running time shoule be around 1 min 20 sec, if submitted using lsf batch script.

When running the batch script, I see:

solver starts at: Mon Mar 30 18:12:47 EDT 2020
jsrun -n 384 -a 1 -c 1 -g 1 ./bin/xspecfem3D
Mon Mar 30 18:12:47 EDT 2020

 **************
 **************
 ADIOS significantly slows down small or medium-size runs, which is the case here, please consider turning it off
 **************
 **************

User defined signal 2
ERROR:  One or more process (first noticed rank 259) terminated with signal 12

The runtime was about 4 seconds. Do you have any suggestion? I did a recursive copy of your specfem3d_globe_990cd4 directory and run from there (needed write permissions to OUTPUT_FILES/). Also, I had to enable the module load commands in the batch script to avoid unresolved library links.

Could you share the location of your running directory? May I take a look?

Sure! It lives here:

/gpfs/alpine/med110/scratch/merzky1/covid/radical.pilot/specfem3d_globe_990cd4

But you will need to be in the med110 group :-( If you are not (which I guess) I can move to a world readable dir - but that will have to wait 'til tomorrow...

Sure! It lives here:

/gpfs/alpine/med110/scratch/merzky1/covid/radical.pilot/specfem3d_globe_990cd4

But you will need to be in the med110 group :-( If you are not (which I guess) I can move to a world readable dir - but that will have to wait 'til tomorrow...

I don't have access.

From the error message itself, I can't tell what is going wrong. Just to do a quick check, could you submit the job again?

I am also trying to give a more clean and lean built SPECFEM. I will do some tests and update to you later.

Sure! It lives here:

/gpfs/alpine/med110/scratch/merzky1/covid/radical.pilot/specfem3d_globe_990cd4

But you will need to be in the med110 group :-( If you are not (which I guess) I can move to a world readable dir - but that will have to wait 'til tomorrow...

Hi Andre, I rebuilt the SPECFEM and could you copy it again to test. The newly built one now only used system modules and libraries. The previous one has dependency on one library that sits on my own home directory.

The SPECFEM3d sits in the same directory:
$WORLDWORK/geo111/lei/entk/specfem3d_globe_990cd4

@wjlei1990 , I tried with different node counts i.e. 1/2/4/8/16/32/64 which is equivalent up to 384 gpus. It seemed I was not able to replicate the hanging issue but my test runs showed that, for example, 384 gpus for single task has the resource file like:

cpu_index_using: physical
rank: 0: { host: 1; cpu: {0,1,2,3}; gpu: {0}}
rank: 1: { host: 1; cpu: {4,5,6,7}; gpu: {1}}
rank: 2: { host: 1; cpu: {8,9,10,11}; gpu: {2}}
rank: 3: { host: 1; cpu: {12,13,14,15}; gpu: {3}}
rank: 4: { host: 1; cpu: {16,17,18,19}; gpu: {4}}
rank: 5: { host: 1; cpu: {20,21,22,23}; gpu: {5}}
rank: 6: { host: 2; cpu: {0,1,2,3}; gpu: {0}}
rank: 7: { host: 2; cpu: {4,5,6,7}; gpu: {1}}
rank: 8: { host: 2; cpu: {8,9,10,11}; gpu: {2}}
rank: 9: { host: 2; cpu: {12,13,14,15}; gpu: {3}}
rank: 10: { host: 2; cpu: {16,17,18,19}; gpu: {4}}
...

rank: 372: { host: 63; cpu: {0,1,2,3}; gpu: {0}}
rank: 373: { host: 63; cpu: {4,5,6,7}; gpu: {1}}
rank: 374: { host: 63; cpu: {8,9,10,11}; gpu: {2}}
rank: 375: { host: 63; cpu: {12,13,14,15}; gpu: {3}}
rank: 376: { host: 63; cpu: {16,17,18,19}; gpu: {4}}
rank: 377: { host: 63; cpu: {20,21,22,23}; gpu: {5}}
rank: 378: { host: 64; cpu: {0,1,2,3}; gpu: {0}}
rank: 379: { host: 64; cpu: {4,5,6,7}; gpu: {1}}
rank: 380: { host: 64; cpu: {8,9,10,11}; gpu: {2}}
rank: 381: { host: 64; cpu: {12,13,14,15}; gpu: {3}}
rank: 382: { host: 64; cpu: {16,17,18,19}; gpu: {4}}
rank: 383: { host: 64; cpu: {20,21,22,23}; gpu: {5}}

Similar placements I observed for the other runs.

My script is almost identical to yours and it is located at $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py. The changes I made are:

$ diff /gpfs/alpine/world-shared/geo111/lei/entk/run_entk.py $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py
55c55
<         'process_type': 'MPI',
---
>         'process_type': None,
57c57
<         'thread_type': 'OpenMP'}
---
>         'thread_type': 'CUDA'}
105c105
<     ncpus = int(nnodes * (22 - 1) * 4)
---
>     ncpus = int(nnodes * 2 * (22 - 1) * 4)
111c111
<         'project': 'GEO111',
---
>         'project': 'CSC393',
116c116
<         'walltime': 10,
---
>         'walltime': 5,

One comment I have is that your calculation for the ncpus is missing 2 * so the numbers will be half than you expected, I doubt if this is the main cause though.

Can you try with smaller node counts and see if it works?

BTW, I can't tell about the new executable, specfem3d_globe_990cd4/bin/xspecfem3D whether it runs as expected. I just ran a sanity check to evaluate if it completes scheduling with the requested resources.

Thanks @lee212 : that task layout looks correct to me. I had the impression though that MPI would be needed? That should result in the same layout though.

@wjlei1990 , I tried with different node counts i.e. 1/2/4/8/16/32/64 which is equivalent up to 384 gpus. It seemed I was not able to replicate the hanging issue but my test runs showed that, for example, 384 gpus for single task has the resource file like:

cpu_index_using: physical
rank: 0: { host: 1; cpu: {0,1,2,3}; gpu: {0}}
rank: 1: { host: 1; cpu: {4,5,6,7}; gpu: {1}}
rank: 2: { host: 1; cpu: {8,9,10,11}; gpu: {2}}
rank: 3: { host: 1; cpu: {12,13,14,15}; gpu: {3}}
rank: 4: { host: 1; cpu: {16,17,18,19}; gpu: {4}}
rank: 5: { host: 1; cpu: {20,21,22,23}; gpu: {5}}
rank: 6: { host: 2; cpu: {0,1,2,3}; gpu: {0}}
rank: 7: { host: 2; cpu: {4,5,6,7}; gpu: {1}}
rank: 8: { host: 2; cpu: {8,9,10,11}; gpu: {2}}
rank: 9: { host: 2; cpu: {12,13,14,15}; gpu: {3}}
rank: 10: { host: 2; cpu: {16,17,18,19}; gpu: {4}}
...

rank: 372: { host: 63; cpu: {0,1,2,3}; gpu: {0}}
rank: 373: { host: 63; cpu: {4,5,6,7}; gpu: {1}}
rank: 374: { host: 63; cpu: {8,9,10,11}; gpu: {2}}
rank: 375: { host: 63; cpu: {12,13,14,15}; gpu: {3}}
rank: 376: { host: 63; cpu: {16,17,18,19}; gpu: {4}}
rank: 377: { host: 63; cpu: {20,21,22,23}; gpu: {5}}
rank: 378: { host: 64; cpu: {0,1,2,3}; gpu: {0}}
rank: 379: { host: 64; cpu: {4,5,6,7}; gpu: {1}}
rank: 380: { host: 64; cpu: {8,9,10,11}; gpu: {2}}
rank: 381: { host: 64; cpu: {12,13,14,15}; gpu: {3}}
rank: 382: { host: 64; cpu: {16,17,18,19}; gpu: {4}}
rank: 383: { host: 64; cpu: {20,21,22,23}; gpu: {5}}

Similar placements I observed for the other runs.

My script is almost identical to yours and it is located at $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py. The changes I made are:

$ diff /gpfs/alpine/world-shared/geo111/lei/entk/run_entk.py $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py
55c55
<         'process_type': 'MPI',
---
>         'process_type': None,
57c57
<         'thread_type': 'OpenMP'}
---
>         'thread_type': 'CUDA'}
105c105
<     ncpus = int(nnodes * (22 - 1) * 4)
---
>     ncpus = int(nnodes * 2 * (22 - 1) * 4)
111c111
<         'project': 'GEO111',
---
>         'project': 'CSC393',
116c116
<         'walltime': 10,
---
>         'walltime': 5,

One comment I have is that your calculation for the ncpus is missing 2 * so the numbers will be half than you expected, I doubt if this is the main cause though.

Can you try with smaller node counts and see if it works?

BTW, I can't tell about the new executable, specfem3d_globe_990cd4/bin/xspecfem3D whether it runs as expected. I just ran a sanity check to evaluate if it completes scheduling with the requested resources.

Hi @lee212 , I copied your script to my directory:

lei@login5 /gpfs/alpine/world-shared/geo111/lei/entk $ 
diff /gpfs/alpine/world-shared/geo111/lei/entk/run_entk.hrlee.py $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py
111c111
<         'project': 'GEO111',
---
>         'project': 'CSC393',

However, the job is not successful. I doubt if your job is also succesful, since when I checked your job output directory, there is no output files generated:

ls $WORLDWORK/csc393/hrlee/hpc-workflow/run_0000/OUTPUT_FILES | wc -l
2

In a successful run, the output should be like this directory:

ls /gpfs/alpine/world-shared/geo111/lei/entk/specfem3d_globe_990cd4/OUTPUT_FILES/ | wc -l
739

Did you remove the output files of your job?

One more intesting behaviour I found in entk is, the python script seems to finish howevery the job is still running on the job queue. From my impression, the job should end first and the entk python script will finish and exit.

Okay, I re-ran with the executable, and saw:

/bin/xspecfem3D: error while loading shared libraries: libblosc.so.1: cannot open shared object file: No such file or directory

Okay, I re-ran with the executable, and saw:

/bin/xspecfem3D: error while loading shared libraries: libblosc.so.1: cannot open shared object file: No such file or directory

I think you may missed some modules:

module load gcc/4.8.5
module load spectrum-mpi
module load hdf5/1.8.18
module load cuda

module load zlib
module load sz
module load zfp
module load c-blosc

Maybe I should put them into my scripts.

I added these modules and ran a test only with 6gpus, result output is here: /gpfs/alpine/world-shared/csc393/hrlee/hpc-workflow /run_with_module_load_6gpus
I see some warning/error messages but are these okay to ignore, can you confirm?

Does it have to run with 384 gpus? I submitted new job anyway which will be likely starting on Monday 2pm.

Okay, the job with 384 gpus is also complete, it seems failed with some errors although these modules were added to pre_exec. The output is here /gpfs/alpine/world-shared/csc393/hrlee/hpc-workflow/run_with_module_load_384gpus

Just in case, my stacks are:

                                                   โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
  radical.entk         : 1.0.2                     โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
  radical.pilot        : 1.2.1                     โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
  radical.saga         : 1.2.0                     โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
  radical.utils        : 1.2.2                     

Just in case, my stacks are:

                                                   โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
  radical.entk         : 1.0.2                     โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
  radical.pilot        : 1.2.1                     โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
  radical.saga         : 1.2.0                     โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
  radical.utils        : 1.2.2                     

Hi @lee212 I think I should have resolved most of the issue. Most of them are just some issues in my own script. Now I can successfully launch a few tasks in using ENTK.

There is one remaining question. I found when the ENTK exit, the job however still stays on the job queue and keep buring hours, even though I think all the tasks should have finished.

Here is what ENTK print to the terminal.

...
submit: ########################################################################
Update: pipeline.0000.stage.0000.task.0000 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0000 state: DONE
Update: pipeline.0000.stage.0000.task.0001 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0001 state: DONE
Update: pipeline.0000.stage.0000.task.0002 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0002 state: DONE
Update: pipeline.0000.stage.0000.task.0003 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0003 state: DONE
Update: pipeline.0000.stage.0000 state: DONE
Update: pipeline.0000 state: DONE
close unit manager                                                            ok
wait for 1 pilot(s)
              0                                                               ok
closing session re.session.login4.lei.018359.0009                              \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
+ re.session.login4.lei.018359.0009 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 457.8s                                                      ok
All components terminated

I suppose the job(on the lsf job queue) should finish with the All components terminated message. Am I correct?

Yes, the job should finish. @wjlei1990 , can you please provide (or point me to) a client side sandbox to check what is happening? That is an RP level error most likely.

a client side sandbox

Hi Andre, I put one example here:

/gpfs/alpine/world-shared/geo111/lei/entk.small/sandbox/re.session.login4.lei.018359.0011/

Hi I would also like to know how to do performance benchmark? Things that are interesting for me including how to measure the overhead, and also like the time spent for each task and stage.

According to @lee212 suggestion, I put following flags on my .bashrc file.

  export RADICAL_PROFILE="TRUE"
  export RADICAL_ENTK_PROFILE="TRUE"
  export RADICAL_PILOT_PROFILE="TRUE"

I think ENTK will generate some profiling files. Could you provide me some instructions on how to use it?

radical.analytics might provide you some numbers/plots, but be aware that it is heavily under development and you may see errors/issues often. Please use it with caution. I think I have full instructions somewhere but a quick guide to try it out is here:

git clone https://github.com/radical-cybertools/radical.analytics.git
cd radical.analytics
pip install .
export RADICAL_PILOT_DBURL=mongodb://rct:rct_test@two.radical-project.org/rct_test

Once this is complete, you run analytics for a particular session, in practice I do like, for example,:

ln -s /gpfs/alpine/world-shared/geo111/lei/entk.small/re.session.login4.lei.018359.0011/
.
bin/radical-analytics-inspect re.session.login4.lei.018359.0011

If this completes successfully, you may find files generated like:

re.session.login4.lei.018359.0011.stats
re.session.login4.lei.018359.0011_conc.png
re.session.login4.lei.018359.0011_dur.png
re.session.login4.lei.018359.0011_rate.png
re.session.login4.lei.018359.0011_util.png

*.stats file provides timing value in plain text and others are plots with different filters to show.

Matteo and Andre can provide better explanation and usage in depth, and correct me if something is missing.

@wjlei1990 , I know this is about testing 384 gpus for one task, for now, but do you or did you run a test with multiple tasks as well? I am just curious if the current version works seamlessly when we increase the number of concurrent tasks.

@wjlei1990 , I know this is about testing 384 gpus for one task, for now, but do you or did you run a test with multiple tasks as well? I am just curious if the current version works seamlessly when we increase the number of concurrent tasks.

I did test with 5 concurrent tasks and each task with 384 nodes. The job run successfully and output files looks good to me. I haven't done any performance check yet.

Here is one example output from radical-analytics-inspect.

Maybe you can teach me how to explain it on this week's meeting.


1. Small Scale Test

This one is from a small scale test.

re.session.login4.lei.018359.0011 [4]
    Agent Nodes         :          0.000     0.000%   !  ['agent']
    Pilot Startup       :      10608.329     8.095%      ['boot', 'setup_1']
    Warmup              :       2509.556     1.915%      ['warm']
    Prepare Execution   :          2.492     0.002%      ['exec_queue', 'exec_prep']
    Pilot Termination   :     120440.127    91.906%      ['term']
    Execution RP        :         13.004     0.010%      ['exec_rp', 'exec_sh', 'term_sh', 'term_rp']
    Execution Cmd       :      15596.728    11.902%      ['exec_cmd']
    Unschedule          :          8.740     0.007%      ['unschedule']
    Draining            :       2084.176     1.590%      ['drain']
    Idle                :      97030.667    74.043%      ['idle']
    total               :     131046.778   100.000%      

    total               :     131046.778   100.000%
    over                :     232697.090   177.568%
    work                :      15596.728    11.902%
    miss                :    -117247.040   -89.470%

re session login4 lei 018359 0011 state
re session login4 lei 018359 0011_conc
re session login4 lei 018359 0011_dur
re session login4 lei 018359 0011_rate
re session login4 lei 018359 0011_util


2. Full Scale Test

This one is full scale test, with 5 concurrent task and each one with 384 nodes. There are total of 50 tasks in the stage.

re.session.login5.lei.018358.0000 [1]
    Agent Nodes         :          0.000     0.000%   !  ['agent']
    Pilot Startup       :     352848.755    18.936%      ['boot', 'setup_1']
    Warmup              :      78204.028     4.197%      ['warm']
    Prepare Execution   :         52.962     0.003%      ['exec_queue', 'exec_prep']
    Pilot Termination   :    1511431.231    81.113%      ['term']
    Execution RP        :        216.342     0.012%      ['exec_rp', 'exec_sh', 'term_sh', 'term_rp']
    Execution Cmd       :     132558.541     7.114%      ['exec_cmd']
    Unschedule          :         72.372     0.004%      ['unschedule']
    Draining            :      32980.902     1.770%      ['drain']
    Idle                :    1171608.708    62.876%      ['idle']
    total               :    1863362.381   100.000%      

    total               :    1863362.381   100.000%
    over                :    3147415.300   168.911%
    work                :     132558.541     7.114%
    miss                :   -1416611.460   -76.024%

Below are figures.
re session login5 lei 018358 0000 state
re session login5 lei 018358 0000_conc
re session login5 lei 018358 0000_dur
re session login5 lei 018358 0000_rate
re session login5 lei 018358 0000_util

Hi @wjlei1990 :

a client side sandbox
Hi Andre, I put one example here:
/gpfs/alpine/world-shared/geo111/lei/entk.small/sandbox/re.session.login4.lei.018359.0011/

Thanks - but that is the pilot sandbox. I meant the session directory on the client side, i.e., which is created in the location where you run the EnTK script. Thanks!

As for the analysis: there is something off with the utilization obviously, I'll look into it. But in general, the utilization won't look great since you are not using the CPU cores, and we count those as idle resources then.

Hi @wjlei1990 :

a client side sandbox
Hi Andre, I put one example here:
/gpfs/alpine/world-shared/geo111/lei/entk.small/sandbox/re.session.login4.lei.018359.0011/

Thanks - but that is the pilot sandbox. I meant the session directory on the client side, i.e., which is created in the location where you run the EnTK script. Thanks!

As for the analysis: there is something off with the utilization obviously, I'll look into it. But in general, the utilization won't look great since you are not using the CPU cores, and we count those as idle resources then.

Hi Andre, the directory is here:

/gpfs/alpine/world-shared/geo111/lei/entk.small/re.session.login4.lei.018359.0011

This one is a small-scale job.

If you are looking for a full-scale job:

/gpfs/alpine/world-shared/geo111/lei/entk/re.session.login5.lei.018358.0000

Thanks. From the logs, it looks like the pilot job gets canceled all right:

radical.log:1586271908.106 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : update cancel req: pilot.0000 1586271908.1062713
radical.log:1586271908.107 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : killing pilots: last cancel: 1586271908.1062713
radical.log:1586271916.831 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271926.845 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271936.858 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271946.871 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271956.884 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Done']

Cancellation takes a while, but that's LSF taking its time. Do you see the job alive for longer than a couple of minutes?

Thanks. From the logs, it looks like the pilot job gets canceled all right:

radical.log:1586271908.106 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : update cancel req: pilot.0000 1586271908.1062713
radical.log:1586271908.107 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : killing pilots: last cancel: 1586271908.1062713
radical.log:1586271916.831 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271926.845 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271936.858 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271946.871 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271956.884 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Done']

Cancellation takes a while, but that's LSF taking its time. Do you see the job alive for longer than a couple of minutes?

Got you. So as long as the job finished with a few minutes within entk script exiting, it should be fine. I think I observed a few minutes lag for my small scale job.

I haven't monitored the large-scale job yet since it is a bit difficult(to predict when the job will be running). But I will keep an eye on it.