ENTK hangs on Summit

Question

ENTK hangs on Summit

Closed this issue 5 years ago · 44 comments

wjlei1990 commented 5 years ago

Hi My current entk python script hangs when running on summit.

I copied my files to world shared directory here so you may replicate the my tests.

/gpfs/alpine/geo111/world-shared/lei/entk

I also prepared a bash script for you to launch job directly.

/gpfs/alpine/geo111/world-shared/lei/entk/specfem/job_solver.bash

The system modules I used:

module load gcc/4.8.5
module load spectrum-mpi
module load hdf5/1.8.18
module load cuda

module load zlib
module load sz
module load zfp
module load c-blosc

Answer 1 · 2020-03-05T20:51:05.000Z

At the line number 46 of run_job.py, schema would be local instead of jsrun.

Answer 2 · 2020-03-05T20:51:32.000Z

As per slack exchange: the pilot sees a SIGNAL 2 while running - no other ERROR logs, no suspicious *.err/*.out files

Answer 3 · 2020-03-05T20:52:26.000Z

@lee212 I actually used 'local' on summit...sorry, jsrun is just one try... I will edit it back to remove confusion.

Answer 4 · 2020-03-05T21:01:43.000Z

@wjlei1990 , thanks for the confirmation. No worries, I used your script on my account and have a similar issue.

Answer 5 · 2020-03-05T21:03:07.000Z

Thanks for the help :)

Answer 6 · 2020-03-05T21:08:37.000Z

queue is missing e.g. "queue":"batch" in the res_dict?

Answer 7 · 2020-03-05T21:09:45.000Z

OK Let me try now...

Answer 8 · 2020-03-05T22:00:15.000Z

I tried adding "queue":"batch" and the job still hangs...

Answer 9 · 2020-03-06T19:08:32.000Z

Talked to @wjlei1990, issue I’m facing in Tiger #95 seem to be the same.

Answer 10 · 2020-03-10T13:36:28.000Z

@lee212 : you said you were able to reproduce this, right? Do you already have any idea whats up?

Answer 11 · 2020-03-18T16:09:39.000Z

This seems related to resource over-allocation, and the correct description would be:

    res_dict = {
        'resource': 'ornl.summit',
        'project': 'GEOxxx',
        'schema': 'local',
        'walltime': 10,
        'cpus': 168,
        'gpus': 6,
        'queue': 'batch'
    }

and the task resource would be:

    t1.cpu_reqs = {
        'processes': 6,
        'process_type': 'MPI',
        'threads_per_process': 4,
        'thread_type': 'OpenMP'}

    t1.gpu_reqs = {
            'processes': 1,
            'process_type': None,
            'threads_per_process': 1,
            'thread_type': 'CUDA'}

this will result in:

rank: 0: { host: 1; cpu: {0,1,2,3}; gpu: {0}}
rank: 1: { host: 1; cpu: {4,5,6,7}; gpu: {1}}
rank: 2: { host: 1; cpu: {8,9,10,11}; gpu: {2}}
rank: 3: { host: 1; cpu: {12,13,14,15}; gpu: {3}}
rank: 4: { host: 1; cpu: {16,17,18,19}; gpu: {4}}
rank: 5: { host: 1; cpu: {20,21,22,23}; gpu: {5}}

Answer 12 · 2020-03-20T15:47:48.000Z

Hi could you explain to me why in res_dict, the value of cpus is 168.

Upates on the current test status:

the single job of CPU and GPU are working, and the running time is expected.
I am testing the multiple tasks running at the same time, and entk seems not happy with it. I got job hangs still.

Answer 13 · 2020-03-23T16:30:34.000Z

Summit compute nodes have (2) 22-core Power9 CPUs where each core supports 4 hardware threads, resulting in 168 = 2 * (22 - 1) * 4. 1 core on each socket has been set aside (-1) for overhead and is not available for allocation through jsrun.

I will look into the hangs on multiple tasks.

Answer 14 · 2020-03-27T18:34:27.000Z

This is critical for Summit allocation renewal. Data need to be ready that show we are running in production on Summit with EnTK.

Answer 15 · 2020-03-28T17:58:42.000Z

@wjlei1990 to address this we will need to reproduce your issue. Unfortunately, this means we will need some information from you:

batch script (without EnTK) that correctly executes your executable. This will be our baseline.
The workflow you are trying to use with EnTK with instructions on how to run it.

We will run your workflow first with a single task, comparing it to our baseline and confirming the behavior your reported. We will then run the same workflow with two concurrent tasks to confirm the issue you report.

@lee212 do you see anything else you will need to debug this?

Answer 16 · 2020-03-29T01:41:38.000Z

Summit compute nodes have (2) 22-core Power9 CPUs where each core supports 4 hardware threads, resulting in 168 = 2 * (22 - 1) * 4. 1 core on each socket has been set aside (-1) for overhead and is not available for allocation through jsrun.

I will look into the hangs on multiple tasks.

I think this may need some corrections.

I just tried one simulation which used 384 GPUS and 384 CPU cores. On summit the job should use 384/6=64 nodes. However, if I asked for 64 * 2 * (22-1) * 4 = 10752 CPUs and 384 GPUs, the slurm showed the job asked for 128 nodes, which is not correct.

My resource alllocation:

    res_dict = {
        'resource': 'ornl.summit',
        'project': 'GEO111',
        'schema': 'local',
        'walltime': 30,
        'gpus': 384,
        'cpus': 10752,
        'queue': 'batch'
    } 

    t1.cpu_reqs = {
        'processes': 384,
        'process_type': 'MPI',
        'threads_per_process': 4,
        'thread_type': 'OpenMP'}

    t1.gpu_reqs = {
        'processes': 1,
        'process_type': None,
        'threads_per_process': 1,                                               
        'thread_type': 'CUDA'}

Could you pls double check it?

Answer 17 · 2020-03-29T02:25:25.000Z

@wjlei1990 to address this we will need to reproduce your issue. Unfortunately, this means we will need some information from you:

batch script (without EnTK) that correctly executes your executable. This will be our baseline.

The workflow you are trying to use with EnTK with instructions on how to run it.

We will run your workflow first with a single task, comparing it to our baseline and confirming the behavior your reported. We will then run the same workflow with two concurrent tasks to confirm the issue you report.

@lee212 do you see anything else you will need to debug this?

the baseline example is located here:
/gpfs/alpine/world-shared/geo111/lei/entk/specfem3d_globe_990cd4.
The bash script to launch the job is: job_solver.bash. You may directly submit and run it, to use as the baseline.
The current entk python script I used is:
/gpfs/alpine/world-shared/geo111/lei/entk/run_entk.py. It has only 1 stage and 1 task in that stage. The task is a forward SPECFEM simulation.

Answer 18 · 2020-03-30T11:43:48.000Z

@wjlei1990 : thanks for the batch script! What is the runtime we are expected to see? I don't mind running this test, but getting jobs of that size will always take a bit and will burn some allocation. If you happen to have a smaller test case available, let us know please :-)

Answer 19 · 2020-03-30T15:25:41.000Z

@wjlei1990 : thanks for the batch script! What is the runtime we are expected to see? I don't mind running this test, but getting jobs of that size will always take a bit and will burn some allocation. If you happen to have a smaller test case available, let us know please :-)

Hi Andre, the running time shoule be around 1 min 20 sec, if submitted using lsf batch script.

Answer 20 · 2020-03-30T23:46:01.000Z

When running the batch script, I see:

solver starts at: Mon Mar 30 18:12:47 EDT 2020
jsrun -n 384 -a 1 -c 1 -g 1 ./bin/xspecfem3D
Mon Mar 30 18:12:47 EDT 2020

 **************
 **************
 ADIOS significantly slows down small or medium-size runs, which is the case here, please consider turning it off
 **************
 **************

User defined signal 2
ERROR:  One or more process (first noticed rank 259) terminated with signal 12

The runtime was about 4 seconds. Do you have any suggestion? I did a recursive copy of your specfem3d_globe_990cd4 directory and run from there (needed write permissions to OUTPUT_FILES/). Also, I had to enable the module load commands in the batch script to avoid unresolved library links.

Answer 21 · 2020-03-31T00:01:24.000Z

Could you share the location of your running directory? May I take a look?

Answer 22 · 2020-03-31T00:21:11.000Z

Sure! It lives here:

/gpfs/alpine/med110/scratch/merzky1/covid/radical.pilot/specfem3d_globe_990cd4

But you will need to be in the med110 group :-( If you are not (which I guess) I can move to a world readable dir - but that will have to wait 'til tomorrow...

Answer 23 · 2020-03-31T00:23:58.000Z

Sure! It lives here:
/gpfs/alpine/med110/scratch/merzky1/covid/radical.pilot/specfem3d_globe_990cd4
But you will need to be in the med110 group :-( If you are not (which I guess) I can move to a world readable dir - but that will have to wait 'til tomorrow...

I don't have access.

From the error message itself, I can't tell what is going wrong. Just to do a quick check, could you submit the job again?

I am also trying to give a more clean and lean built SPECFEM. I will do some tests and update to you later.

Answer 24 · 2020-03-31T00:27:15.000Z

Sure! It lives here:
/gpfs/alpine/med110/scratch/merzky1/covid/radical.pilot/specfem3d_globe_990cd4
But you will need to be in the med110 group :-( If you are not (which I guess) I can move to a world readable dir - but that will have to wait 'til tomorrow...

Hi Andre, I rebuilt the SPECFEM and could you copy it again to test. The newly built one now only used system modules and libraries. The previous one has dependency on one library that sits on my own home directory.

The SPECFEM3d sits in the same directory:
$WORLDWORK/geo111/lei/entk/specfem3d_globe_990cd4

Answer 25 · 2020-04-04T03:55:43.000Z

@wjlei1990 , I tried with different node counts i.e. 1/2/4/8/16/32/64 which is equivalent up to 384 gpus. It seemed I was not able to replicate the hanging issue but my test runs showed that, for example, 384 gpus for single task has the resource file like:

cpu_index_using: physical
rank: 0: { host: 1; cpu: {0,1,2,3}; gpu: {0}}
rank: 1: { host: 1; cpu: {4,5,6,7}; gpu: {1}}
rank: 2: { host: 1; cpu: {8,9,10,11}; gpu: {2}}
rank: 3: { host: 1; cpu: {12,13,14,15}; gpu: {3}}
rank: 4: { host: 1; cpu: {16,17,18,19}; gpu: {4}}
rank: 5: { host: 1; cpu: {20,21,22,23}; gpu: {5}}
rank: 6: { host: 2; cpu: {0,1,2,3}; gpu: {0}}
rank: 7: { host: 2; cpu: {4,5,6,7}; gpu: {1}}
rank: 8: { host: 2; cpu: {8,9,10,11}; gpu: {2}}
rank: 9: { host: 2; cpu: {12,13,14,15}; gpu: {3}}
rank: 10: { host: 2; cpu: {16,17,18,19}; gpu: {4}}
...

rank: 372: { host: 63; cpu: {0,1,2,3}; gpu: {0}}
rank: 373: { host: 63; cpu: {4,5,6,7}; gpu: {1}}
rank: 374: { host: 63; cpu: {8,9,10,11}; gpu: {2}}
rank: 375: { host: 63; cpu: {12,13,14,15}; gpu: {3}}
rank: 376: { host: 63; cpu: {16,17,18,19}; gpu: {4}}
rank: 377: { host: 63; cpu: {20,21,22,23}; gpu: {5}}
rank: 378: { host: 64; cpu: {0,1,2,3}; gpu: {0}}
rank: 379: { host: 64; cpu: {4,5,6,7}; gpu: {1}}
rank: 380: { host: 64; cpu: {8,9,10,11}; gpu: {2}}
rank: 381: { host: 64; cpu: {12,13,14,15}; gpu: {3}}
rank: 382: { host: 64; cpu: {16,17,18,19}; gpu: {4}}
rank: 383: { host: 64; cpu: {20,21,22,23}; gpu: {5}}

Similar placements I observed for the other runs.

My script is almost identical to yours and it is located at $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py. The changes I made are:

$ diff /gpfs/alpine/world-shared/geo111/lei/entk/run_entk.py $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py
55c55
<         'process_type': 'MPI',
---
>         'process_type': None,
57c57
<         'thread_type': 'OpenMP'}
---
>         'thread_type': 'CUDA'}
105c105
<     ncpus = int(nnodes * (22 - 1) * 4)
---
>     ncpus = int(nnodes * 2 * (22 - 1) * 4)
111c111
<         'project': 'GEO111',
---
>         'project': 'CSC393',
116c116
<         'walltime': 10,
---
>         'walltime': 5,

One comment I have is that your calculation for the ncpus is missing 2 * so the numbers will be half than you expected, I doubt if this is the main cause though.

Can you try with smaller node counts and see if it works?

BTW, I can't tell about the new executable, specfem3d_globe_990cd4/bin/xspecfem3D whether it runs as expected. I just ran a sanity check to evaluate if it completes scheduling with the requested resources.

Answer 26 · 2020-04-04T09:33:55.000Z

Thanks @lee212 : that task layout looks correct to me. I had the impression though that MPI would be needed? That should result in the same layout though.

Answer 27 · 2020-04-04T16:04:07.000Z

@wjlei1990 , I tried with different node counts i.e. 1/2/4/8/16/32/64 which is equivalent up to 384 gpus. It seemed I was not able to replicate the hanging issue but my test runs showed that, for example, 384 gpus for single task has the resource file like:
cpu_index_using: physical
rank: 0: { host: 1; cpu: {0,1,2,3}; gpu: {0}}
rank: 1: { host: 1; cpu: {4,5,6,7}; gpu: {1}}
rank: 2: { host: 1; cpu: {8,9,10,11}; gpu: {2}}
rank: 3: { host: 1; cpu: {12,13,14,15}; gpu: {3}}
rank: 4: { host: 1; cpu: {16,17,18,19}; gpu: {4}}
rank: 5: { host: 1; cpu: {20,21,22,23}; gpu: {5}}
rank: 6: { host: 2; cpu: {0,1,2,3}; gpu: {0}}
rank: 7: { host: 2; cpu: {4,5,6,7}; gpu: {1}}
rank: 8: { host: 2; cpu: {8,9,10,11}; gpu: {2}}
rank: 9: { host: 2; cpu: {12,13,14,15}; gpu: {3}}
rank: 10: { host: 2; cpu: {16,17,18,19}; gpu: {4}}
...

rank: 372: { host: 63; cpu: {0,1,2,3}; gpu: {0}}
rank: 373: { host: 63; cpu: {4,5,6,7}; gpu: {1}}
rank: 374: { host: 63; cpu: {8,9,10,11}; gpu: {2}}
rank: 375: { host: 63; cpu: {12,13,14,15}; gpu: {3}}
rank: 376: { host: 63; cpu: {16,17,18,19}; gpu: {4}}
rank: 377: { host: 63; cpu: {20,21,22,23}; gpu: {5}}
rank: 378: { host: 64; cpu: {0,1,2,3}; gpu: {0}}
rank: 379: { host: 64; cpu: {4,5,6,7}; gpu: {1}}
rank: 380: { host: 64; cpu: {8,9,10,11}; gpu: {2}}
rank: 381: { host: 64; cpu: {12,13,14,15}; gpu: {3}}
rank: 382: { host: 64; cpu: {16,17,18,19}; gpu: {4}}
rank: 383: { host: 64; cpu: {20,21,22,23}; gpu: {5}}
Similar placements I observed for the other runs.

My script is almost identical to yours and it is located at $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py. The changes I made are:
$ diff /gpfs/alpine/world-shared/geo111/lei/entk/run_entk.py $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py
55c55
<         'process_type': 'MPI',
---
>         'process_type': None,
57c57
<         'thread_type': 'OpenMP'}
---
>         'thread_type': 'CUDA'}
105c105
<     ncpus = int(nnodes * (22 - 1) * 4)
---
>     ncpus = int(nnodes * 2 * (22 - 1) * 4)
111c111
<         'project': 'GEO111',
---
>         'project': 'CSC393',
116c116
<         'walltime': 10,
---
>         'walltime': 5,
One comment I have is that your calculation for the ncpus is missing 2 * so the numbers will be half than you expected, I doubt if this is the main cause though.

Can you try with smaller node counts and see if it works?

BTW, I can't tell about the new executable, specfem3d_globe_990cd4/bin/xspecfem3D whether it runs as expected. I just ran a sanity check to evaluate if it completes scheduling with the requested resources.

Hi @lee212 , I copied your script to my directory:

lei@login5 /gpfs/alpine/world-shared/geo111/lei/entk $ 
diff /gpfs/alpine/world-shared/geo111/lei/entk/run_entk.hrlee.py $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py
111c111
<         'project': 'GEO111',
---
>         'project': 'CSC393',

However, the job is not successful. I doubt if your job is also succesful, since when I checked your job output directory, there is no output files generated:

ls $WORLDWORK/csc393/hrlee/hpc-workflow/run_0000/OUTPUT_FILES | wc -l
2

In a successful run, the output should be like this directory:

ls /gpfs/alpine/world-shared/geo111/lei/entk/specfem3d_globe_990cd4/OUTPUT_FILES/ | wc -l
739

Did you remove the output files of your job?

One more intesting behaviour I found in entk is, the python script seems to finish howevery the job is still running on the job queue. From my impression, the job should end first and the entk python script will finish and exit.

Answer 28 · 2020-04-05T05:14:04.000Z

Okay, I re-ran with the executable, and saw:

/bin/xspecfem3D: error while loading shared libraries: libblosc.so.1: cannot open shared object file: No such file or directory

Answer 29 · 2020-04-05T16:20:12.000Z

Okay, I re-ran with the executable, and saw:

/bin/xspecfem3D: error while loading shared libraries: libblosc.so.1: cannot open shared object file: No such file or directory

I think you may missed some modules:

module load gcc/4.8.5
module load spectrum-mpi
module load hdf5/1.8.18
module load cuda

module load zlib
module load sz
module load zfp
module load c-blosc

Maybe I should put them into my scripts.

Answer 30 · 2020-04-05T18:44:28.000Z

I added these modules and ran a test only with 6gpus, result output is here: /gpfs/alpine/world-shared/csc393/hrlee/hpc-workflow /run_with_module_load_6gpus
I see some warning/error messages but are these okay to ignore, can you confirm?

Does it have to run with 384 gpus? I submitted new job anyway which will be likely starting on Monday 2pm.

Answer 31 · 2020-04-06T12:53:51.000Z

Okay, the job with 384 gpus is also complete, it seems failed with some errors although these modules were added to pre_exec. The output is here /gpfs/alpine/world-shared/csc393/hrlee/hpc-workflow/run_with_module_load_384gpus

Answer 32 · 2020-04-06T12:55:40.000Z

Just in case, my stacks are:

                                                   │····································································
  radical.entk         : 1.0.2                     │····································································
  radical.pilot        : 1.2.1                     │····································································
  radical.saga         : 1.2.0                     │····································································
  radical.utils        : 1.2.2

Answer 33 · 2020-04-07T04:29:49.000Z

Just in case, my stacks are:

                                                   │····································································
  radical.entk         : 1.0.2                     │····································································
  radical.pilot        : 1.2.1                     │····································································
  radical.saga         : 1.2.0                     │····································································
  radical.utils        : 1.2.2

Hi @lee212 I think I should have resolved most of the issue. Most of them are just some issues in my own script. Now I can successfully launch a few tasks in using ENTK.

There is one remaining question. I found when the ENTK exit, the job however still stays on the job queue and keep buring hours, even though I think all the tasks should have finished.

Here is what ENTK print to the terminal.

...
submit: ########################################################################
Update: pipeline.0000.stage.0000.task.0000 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0000 state: DONE
Update: pipeline.0000.stage.0000.task.0001 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0001 state: DONE
Update: pipeline.0000.stage.0000.task.0002 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0002 state: DONE
Update: pipeline.0000.stage.0000.task.0003 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0003 state: DONE
Update: pipeline.0000.stage.0000 state: DONE
Update: pipeline.0000 state: DONE
close unit manager                                                            ok
wait for 1 pilot(s)
              0                                                               ok
closing session re.session.login4.lei.018359.0009                              \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
+ re.session.login4.lei.018359.0009 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 457.8s                                                      ok
All components terminated

I suppose the job(on the lsf job queue) should finish with the All components terminated message. Am I correct?

Answer 34 · 2020-04-07T08:27:17.000Z

Yes, the job should finish. @wjlei1990 , can you please provide (or point me to) a client side sandbox to check what is happening? That is an RP level error most likely.

Answer 35 · 2020-04-07T15:12:45.000Z

a client side sandbox

Hi Andre, I put one example here:

/gpfs/alpine/world-shared/geo111/lei/entk.small/sandbox/re.session.login4.lei.018359.0011/

Answer 36 · 2020-04-08T14:46:36.000Z

Hi I would also like to know how to do performance benchmark? Things that are interesting for me including how to measure the overhead, and also like the time spent for each task and stage.

According to @lee212 suggestion, I put following flags on my .bashrc file.

  export RADICAL_PROFILE="TRUE"
  export RADICAL_ENTK_PROFILE="TRUE"
  export RADICAL_PILOT_PROFILE="TRUE"

I think ENTK will generate some profiling files. Could you provide me some instructions on how to use it?

Answer 37 · 2020-04-08T16:29:28.000Z

radical.analytics might provide you some numbers/plots, but be aware that it is heavily under development and you may see errors/issues often. Please use it with caution. I think I have full instructions somewhere but a quick guide to try it out is here:

git clone https://github.com/radical-cybertools/radical.analytics.git
cd radical.analytics
pip install .
export RADICAL_PILOT_DBURL=mongodb://rct:rct_test@two.radical-project.org/rct_test

Once this is complete, you run analytics for a particular session, in practice I do like, for example,:

ln -s /gpfs/alpine/world-shared/geo111/lei/entk.small/re.session.login4.lei.018359.0011/
.
bin/radical-analytics-inspect re.session.login4.lei.018359.0011

If this completes successfully, you may find files generated like:

re.session.login4.lei.018359.0011.stats
re.session.login4.lei.018359.0011_conc.png
re.session.login4.lei.018359.0011_dur.png
re.session.login4.lei.018359.0011_rate.png
re.session.login4.lei.018359.0011_util.png

*.stats file provides timing value in plain text and others are plots with different filters to show.

Matteo and Andre can provide better explanation and usage in depth, and correct me if something is missing.

Answer 38 · 2020-04-08T22:03:07.000Z

@wjlei1990 , I know this is about testing 384 gpus for one task, for now, but do you or did you run a test with multiple tasks as well? I am just curious if the current version works seamlessly when we increase the number of concurrent tasks.

Answer 39 · 2020-04-08T22:22:05.000Z

@wjlei1990 , I know this is about testing 384 gpus for one task, for now, but do you or did you run a test with multiple tasks as well? I am just curious if the current version works seamlessly when we increase the number of concurrent tasks.

I did test with 5 concurrent tasks and each task with 384 nodes. The job run successfully and output files looks good to me. I haven't done any performance check yet.

Answer 40 · 2020-04-08T22:24:15.000Z

Here is one example output from radical-analytics-inspect.

Maybe you can teach me how to explain it on this week's meeting.

1. Small Scale Test

This one is from a small scale test.

re.session.login4.lei.018359.0011 [4]
    Agent Nodes         :          0.000     0.000%   !  ['agent']
    Pilot Startup       :      10608.329     8.095%      ['boot', 'setup_1']
    Warmup              :       2509.556     1.915%      ['warm']
    Prepare Execution   :          2.492     0.002%      ['exec_queue', 'exec_prep']
    Pilot Termination   :     120440.127    91.906%      ['term']
    Execution RP        :         13.004     0.010%      ['exec_rp', 'exec_sh', 'term_sh', 'term_rp']
    Execution Cmd       :      15596.728    11.902%      ['exec_cmd']
    Unschedule          :          8.740     0.007%      ['unschedule']
    Draining            :       2084.176     1.590%      ['drain']
    Idle                :      97030.667    74.043%      ['idle']
    total               :     131046.778   100.000%      

    total               :     131046.778   100.000%
    over                :     232697.090   177.568%
    work                :      15596.728    11.902%
    miss                :    -117247.040   -89.470%

2. Full Scale Test

This one is full scale test, with 5 concurrent task and each one with 384 nodes. There are total of 50 tasks in the stage.

re.session.login5.lei.018358.0000 [1]
    Agent Nodes         :          0.000     0.000%   !  ['agent']
    Pilot Startup       :     352848.755    18.936%      ['boot', 'setup_1']
    Warmup              :      78204.028     4.197%      ['warm']
    Prepare Execution   :         52.962     0.003%      ['exec_queue', 'exec_prep']
    Pilot Termination   :    1511431.231    81.113%      ['term']
    Execution RP        :        216.342     0.012%      ['exec_rp', 'exec_sh', 'term_sh', 'term_rp']
    Execution Cmd       :     132558.541     7.114%      ['exec_cmd']
    Unschedule          :         72.372     0.004%      ['unschedule']
    Draining            :      32980.902     1.770%      ['drain']
    Idle                :    1171608.708    62.876%      ['idle']
    total               :    1863362.381   100.000%      

    total               :    1863362.381   100.000%
    over                :    3147415.300   168.911%
    work                :     132558.541     7.114%
    miss                :   -1416611.460   -76.024%

Below are figures.

Answer 41 · 2020-04-09T09:49:47.000Z

Hi @wjlei1990 :

a client side sandbox
Hi Andre, I put one example here:
/gpfs/alpine/world-shared/geo111/lei/entk.small/sandbox/re.session.login4.lei.018359.0011/

Thanks - but that is the pilot sandbox. I meant the session directory on the client side, i.e., which is created in the location where you run the EnTK script. Thanks!

As for the analysis: there is something off with the utilization obviously, I'll look into it. But in general, the utilization won't look great since you are not using the CPU cores, and we count those as idle resources then.

Answer 42 · 2020-04-09T13:30:09.000Z

Hi @wjlei1990 :

a client side sandbox
Hi Andre, I put one example here:
/gpfs/alpine/world-shared/geo111/lei/entk.small/sandbox/re.session.login4.lei.018359.0011/

Thanks - but that is the pilot sandbox. I meant the session directory on the client side, i.e., which is created in the location where you run the EnTK script. Thanks!

As for the analysis: there is something off with the utilization obviously, I'll look into it. But in general, the utilization won't look great since you are not using the CPU cores, and we count those as idle resources then.

Hi Andre, the directory is here:

/gpfs/alpine/world-shared/geo111/lei/entk.small/re.session.login4.lei.018359.0011

This one is a small-scale job.

If you are looking for a full-scale job:

/gpfs/alpine/world-shared/geo111/lei/entk/re.session.login5.lei.018358.0000

Answer 43 · 2020-04-09T14:08:29.000Z

Thanks. From the logs, it looks like the pilot job gets canceled all right:

radical.log:1586271908.106 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : update cancel req: pilot.0000 1586271908.1062713
radical.log:1586271908.107 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : killing pilots: last cancel: 1586271908.1062713
radical.log:1586271916.831 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271926.845 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271936.858 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271946.871 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271956.884 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Done']

Cancellation takes a while, but that's LSF taking its time. Do you see the job alive for longer than a couple of minutes?

Answer 44 · 2020-04-10T18:09:54.000Z

Thanks. From the logs, it looks like the pilot job gets canceled all right:

radical.log:1586271908.106 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : update cancel req: pilot.0000 1586271908.1062713
radical.log:1586271908.107 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : killing pilots: last cancel: 1586271908.1062713
radical.log:1586271916.831 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271926.845 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271936.858 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271946.871 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271956.884 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Done']

Cancellation takes a while, but that's LSF taking its time. Do you see the job alive for longer than a couple of minutes?

Got you. So as long as the job finished with a few minutes within entk script exiting, it should be fine. I think I observed a few minutes lag for my small scale job.

I haven't monitored the large-scale job yet since it is a bit difficult(to predict when the job will be running). But I will keep an eye on it.