radical-collaboration/hpc-workflows

mpiexec or mpirun

Weiming-Hu opened this issue ยท 15 comments

I have tested the OpenMP MPI hybrid program on NCAR Cheyenne. It appears to me that on Cheyenne, I need to run the program with mpiexec_mpt in order to achieve full performance.

I noticed that EnTK by default uses mpirun to launch programs (maybe I'm mistaken?), is there a proper way to change this? Do you have any suggestions for running hybrid programs on Stampede2?

Thank you

We will provide an mpiexec_mpt configuration for RCT.

Thank you Andre. Just to double-check this. mpiexec_mpt is on Cheyenne, but I'm not sure if it is available on Stampede2. Are you aware of any of this? Thank you.

Uh, I may read this ticket incorrectly - are you asking for mpiexec_mpt on Stamedee then?

BTW: it seems we run out of allocation on cheyenne?

qsub: Account is Overspent: Standby Queue Disabled. Contact cislhelp@ucar.edu

To expand on MPT: MPT is an SGI specific MPI flavor. Cheyenne is an SGI machine, and thus has that MPI version. Stampede2 is a Dell machine, and has no MPT - and will not have.

Do you have specific problem with the used mpi version, like performance degredation, or was the question for MPT an attempt to reproduce the Cheyenne runs as closely as possible?

Hi @andre-merzky, Thank you for your reply.

For your first question on the overspending account, I'm going to pin @cervone so that he can help to resolve this.

For your second question: I guess right now, I don't have specific problems with MPI versions. Sorry for the confusion. Please let me expand on this a bit more.

The analog program is OpenMP and MPI hybrid but there are no OpenMP codes within the MPI parallel processes, in other words, MPI and OpenMP are not nested or worker processes are only single-threaded. OpenMP is only used by the master process to speed up the bottleneck. That way, other worker processes don't wait for too long while the master process is going through the bottleneck.

On Cheyenne, when I launch my program with mpirun, each process is only allowed 1 thread. Even multiple threads can be created, they are sharing only one core. Therefore, I need to use mpiexec_mpt with the additional argument omplace. So I would do something like mpiexec_mpt omplace -np 360 anen_grib_mpi.

I have consulted the helpdesk of Stampede2, and they suggested, to achieve the full performance in my case, I need to use ibrun. I have tested these and multi-threading is working in this fashion.

To give you some more information, file I/O (parallelized with MPI) takes about 20 minutes and analog computation (parallelized with MPI) takes about 20 minutes. The bottleneck (master process only with one thread) takes about 10 ~ 15 minutes. That's why I parallelized the bottleneck with OpenMP and hope to bring the bottleneck runtime down because I don't want other worker processes to waste too much time waiting for the master.

I apologize for my verbosity. But I hope this is clear. Thank you.

Thanks for the verbosity actually, that cleans things up!

Please set the EnTK equivalent of

cpu_threads = <n>
cpu_thread_type = OpenMP

that will provide all processes with the respective number of codes and will set OMP_NUM_THREADS accordingly. Note though that this holds for all processes in the MPI task, not only for rank.0 ! At the moment, we can't express heterogeneous task layout, i.e., layouts where ranks in the same task need different resources provided. But I assume that is the same what mpiexec_mpt would have provided you with?

FWIW, we can switch to ibrun if that turns out to make a difference, but we I am not sure how well our ibrun connector handles those mappings. Happy to look into it, but would prefer to stay on mpirun or mpiexec unless we detect a problem with that.

The EnTK settings should be:

    'threads_per_process' : <n>,
    'thread_type'         : 'OpenMP

That makes perfect sense. And for the heterogeneous task layout, actually, in anen program, worker processes are allowed to create only one thread. It is assumed that all cores have been occupied by one process through proper setup and launching. So I wouldn't worry about "overloading" the machine.

So this is what I plan to do: I'm going to test run with the new settings (threads_per_process and thread_type) you just provided on Stampede2 and see how well it goes. Does that sound good?

So I wouldn't worry about "overloading" the machine

I am not worried about overloading, I am worried about underloading (underutilizing) the machine :-) Assume you defined treads_per_process as 4. Now every MPI rank is assumed to have 4 OpenMP threads, and thus will get 4 cores assigned. If the ranks (apart from rank.0) then only use one thread, then 3 out of 4 cores remain idle...

Lets see how much of a problem that will create... Either way, you need to make sure that the pilot is large enough to hold all ranks (n_procs * nthreads cores)

This is the profiling for one task

****************     Profiler Summary    *****************
                                            Total: processor time (00:36:48,    100%)
                                reading forecasts: processor time (00:05:00, 13.587%)	 current memory (7727300608 bytes)	 peak memory (7755616256 bytes)
                                 reading analysis: processor time (00:03:07,  8.469%)	 current memory (12742877184 bytes)	 peak memory (12742877184 bytes)
                 calculating wind speed/direction: processor time (00:01:43,  4.665%)	 current memory (12743110656 bytes)	 peak memory (12743110656 bytes)
                            reformatting analysis: processor time (00:00:36,  1.630%)	 current memory (17758724096 bytes)	 peak memory (17758724096 bytes)
                     extracting test/search times: processor time (00:00:00,  0.000%)	 current memory (17758961664 bytes)	 peak memory (17758961664 bytes)
          Master scattering forecasts (AnEnISMPI): processor time (00:00:35,  1.585%)	 current memory (17759236096 bytes)	 peak memory (17759236096 bytes)
       Master scattering observations (AnEnISMPI): processor time (00:00:53,  2.400%)	 current memory (17759612928 bytes)	 peak memory (17759612928 bytes)
       Master scattering observations (AnEnISMPI): processor time (00:00:00,  0.000%)	 current memory (17759830016 bytes)	 peak memory (17759830016 bytes)
                 Master preprocessing (AnEnISMPI): processor time (00:00:16,  0.725%)	 current memory (33780224000 bytes)	 peak memory (33780224000 bytes)
Master waiting for analog computation (AnEnISMPI): processor time (00:03:11,  8.650%)	 current memory (33780424704 bytes)	 peak memory (33780424704 bytes)
             Master receiving analogs (AnEnISMPI): processor time (00:00:53,  2.400%)	 current memory (33780469760 bytes)	 peak memory (33780469760 bytes)
                     writing multivariate analogs: processor time (00:20:01, 54.393%)	 current memory (33782976512 bytes)	 peak memory (38302924800 bytes)
               writing forecasts and observations: processor time (00:00:27,  1.223%)	 current memory (36248170496 bytes)	 peak memory (38302924800 bytes)
**************** End of Profiler Summary *****************

The dimension of the task is

Number of nodes: 10
Number of MPI tasks: 680
Number of stations: 4000 (USA total 101280 grids, ~ 26 such tasks)
Number of test times: 347 (~ 1 year)
Number of lead times: 37
Number of search times: 1075 (2 years + 1 operational year)

I would like to discuss during our meeting today about the possibilities of running the task at scale and whether we need further optimization.

This is the profiling with the second stage for the PV simulation with a single thread.

(venv) geogadmins-Air:RenewableSimulator wuh20$ python evergreen.py --profile
Simulation dimensions:

-- 1 scenarios
-- 1910 stations
-- 347 test days
-- 37 lead times
-- 11 analog memebrs

Start PV simulation with AnEn ...
Early stopping is engaged. Progress will be terminated after 2035 simulated instances including 5 * 37 lead times * 11 analog members
PV simulation |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 100.0% - 0s
Simulation terminated due to the profiling tool engaged

  _     ._   __/__   _ _  _  _ _/_   Recorded: 13:56:07  Samples:  20724
 /_//_/// /_\ / //_// / //_'/ //     Duration: 21.768    CPU time: 21.658
/   _/                      v3.1.3

Program: evergreen.py --profile

21.767 <module>  evergreen.py:17
โ””โ”€ 21.767 run_pv_simulations_with_analogs  evergreen.py:38
   โ”œโ”€ 15.698 simulate_single_instance  Functions.py:70
   โ”‚  โ”œโ”€ 5.941 disc  pvlib/irradiance.py:1273
   โ”‚  โ”‚     [1973 frames hidden]  pvlib, pandas, numpy, abc, <__array_f...
   โ”‚  โ”œโ”€ 4.476 sapm  pvlib/pvsystem.py:1586
   โ”‚  โ”‚     [1137 frames hidden]  pvlib, pandas, numpy, abc, <__array_f...
   โ”‚  โ”œโ”€ 1.638 haydavies  pvlib/irradiance.py:700
   โ”‚  โ”‚     [1192 frames hidden]  pvlib, pandas, numpy, abc
   โ”‚  โ”œโ”€ 1.012 sapm_effective_irradiance  pvlib/pvsystem.py:1868
   โ”‚  โ”‚     [763 frames hidden]  pvlib, pandas, numpy, abc, <__array_f...
   โ”‚  โ”œโ”€ 1.002 poa_components  pvlib/irradiance.py:440
   โ”‚  โ”‚     [899 frames hidden]  pvlib, pandas, numpy, abc
   โ”‚  โ”œโ”€ 0.998 aoi  pvlib/irradiance.py:192
   โ”‚  โ”‚     [757 frames hidden]  pvlib, pandas, numpy, abc
   โ”‚  โ””โ”€ 0.479 sapm_cell  pvlib/temperature.py:33
   โ”‚        [371 frames hidden]  pvlib, pandas, numpy, abc
   โ”œโ”€ 2.495 [self]  
   โ”œโ”€ 1.957 get_solarposition  pvlib/location.py:166
   โ”‚     [521 frames hidden]  pvlib, pandas, numpy, abc, <__array_f...
   โ”œโ”€ 0.717 _StartCountStride  netCDF4/utils.py:89
   โ”‚     [12 frames hidden]  netCDF4, numpy
   โ””โ”€ 0.275 __new__  numpy/ma/core.py:2782
         [18 frames hidden]  numpy

I noticed that EnTK is using srun. But I have encountered the following issue when using srun on Stampede.

c431-041[knl](1019)$ srun -N 1 -n 1 anen_grib_mpi -c anen.cfg chunk.cfg --out  /scratch/04672/tg839717/test.nc > test.log
Error: This is an MPI program. You need to launch this program with an MPI launcher, e.g. mpirun or mpiexec.
[proxy:0:1@c475-023.stampede2.tacc.utexas.edu] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "c475-023.stampede2.tacc.utexas.edu" to "localhost" (Connection refused)
[proxy:0:1@c475-023.stampede2.tacc.utexas.edu] HYD_singleton_connect (../../pm/pmiserv/pmip_cb.c:4309): cannot connect to singleton port
[proxy:0:1@c475-023.stampede2.tacc.utexas.edu] main (../../pm/pmiserv/pmip.c:522): unable to establish singleton connection
[mpiexec@c431-041.stampede2.tacc.utexas.edu] control_cb (../../pm/pmiserv/pmiserv_cb.c:864): connection to proxy 1 at host c475-023 failed
[mpiexec@c431-041.stampede2.tacc.utexas.edu] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@c431-041.stampede2.tacc.utexas.edu] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event
[mpiexec@c431-041.stampede2.tacc.utexas.edu] main (../../ui/mpich/mpiexec.c:1149): process manager error waiting for completion
[unset]: readline failed
[unset]: readline failed
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(435).....: MPI_Finalize failed
MPI_Finalize(346).....: fail failed
MPID_Finalize(219)....: fail failed
MPIDI_PG_Finalize(141): PMI_Finalize failed, error -1
[unset]: write_line error; fd=9 buf=:cmd=abort exitcode=806978831
:
system msg for write_line failure : Bad file descriptor
srun: error: c431-041: task 0: Exited with exit code 15

@Weiming-Hu : what MPI module do you use on Stampede2 and Cheyenne to compile your code? Thanks.

These are the modules I use:

module load boost pnetcdf/1.11.0 python3/3.7.0 phdf5/1.10.4 parallel-netcdf/4.6.2

This is now about running Penn State on Cheyenne