GlobalArrays/ga

when use GA_get, both two process stuck in epoll_wait

Closed this issue · 12 comments

I run two process with intel MPI(one API 2021). when encounter ga->get, process 0 pass it but stuck in later place, and process 1 stuck in it. whether the target of two process is same or completly different, whis will happen. I use strace to check the two stuck processes. the result shows both process is waiting for the epoll_wait. why this happen? how can I check the error?

strace and lsof (pfci.x is my application)

//use `strace -p  31352`  
epoll_wait(18, [], 100, 0)              = 0
//use `strace -p 31351` 
epoll_wait(19, [], 100, 0)              = 0
//use `lsof -p 31352`
pfci.x  31352 jslo   18u  a_inode               0,13         0     11815 [eventpoll]
//use `lsof -p 31351`
pfci.x  31351 jslo   19u  a_inode               0,13         0     11815 [eventpoll]

by the way, the same environment can run the example with operator "get, put" successfully.
the part code is as below

GA::Initialize(argc, argv, heap, stack, MT_F_REAL, 0);

int gaSpace_type = MT_C_CHAR;
int gaSpace_ndim = 2;
int gaSpace_dims[2] = {100, 129};
int nblockSpace[2] = {nprocs, 1};
int mapsSpace[nprocs+1];

GA::GlobalArray *gaSpace = GA::SERVICES.createGA(gaSpace_type, gaSpace_ndim, gaSpace_dims, (char*)"gaSpace", nblockSpace, mapsSpace);

//put part is  putting the char  array into the GLOBAL ARRAY `gaSpace`

gaSpace->get(lo, hi, rowState, &ld);
lo1[0] = 0+rank*33, lo1[1] = 0;
hi1[0] = 32+rank*33, hi1[1] = 128;
gaSpace->get(lo1, hi1, colState, &ld);

Please tell me if anything else is needed, hope to get your reply, thank you !

Please provide MPI details and how you compiled ARMCI (e.g. ARMCI_NETWORK).

Please provide MPI details and how you compiled ARMCI (e.g. ARMCI_NETWORK).

Thanks for your reply.
my MPI version is

 ~$mpiexec --version
Intel(R) MPI Library for Linux* OS, Version 2021.2 Build 20210302 (id: f4f7c92cd)
Copyright 2003-2021, Intel Corporation.

and I didn't set ARMCI_NETWORK or anything about ARMCI when compile GA(5.8.0), my configure is

./configure F77=gfortran CC=gcc CXX=g++ MPIF77=mpif77 MPICXX=mpicxx MPICC=mpicc --with-gnu-ld --enable-cxx

and then make && make install.
I didn't compile ARMCI independently.

Can you share the config.log in the GA directory plus the ARMCI and COMEX subdirectories? Thanks!

Can you share the config.log in the GA directory plus the ARMCI and COMEX subdirectories? Thanks!

Attachments are these files.
And I thought I find the superficial problem :
I use boost.mpi and PETSc with GA, the initialization order is :1.GA, 2.BOOST, 3.PETSc. And the stuck point is after PETSc initialize. Rank 1 stuck in the GA get operator and Rank 2 stuck in the PETSc MatAssemblyBegin operator(because all the printf code before MatAssemblyBegin will be print by rank 0), MatAssemblyBegin operation is below the GA get.
I try to use GA::SERVICES.sync() before MatAssemblyBegin . Then, the stuck phenomenon isn't occur anymore! Maybe there is something conflict between ga.get and PETSc's MatAssemblyBegin? Really hope to fix it for both PETSc and GA are powerful tool in high performance computing.

comex-config.log
armci-config.log
ga-config.log

configure:10551: WARNING: No ARMCI_NETWORK specified, defaulting to MPI_TS

MPI-TS does not make asynchronous progress, which likely explains why this program deadlocks. Please set ARMCI_NETWORK=MPI-PR and recompile GA. Note that this will require you to run on at least 2 processes per node, because 1 process per node is dedicated to communication.

I would also recommend initializing MPI before GA and PETSc, just to be safe. You can do this with Boost.MPI.

In any case, a GA program that deadlocks because of lack of progress is suggestive of a suboptimal usage model. I cannot say incorrect, because there is not a proper specification to reason about GA, but all of the GA programs I've studied are only sensitive to asynchronous progress in performance, not successful termination.

configure:10551: WARNING: No ARMCI_NETWORK specified, defaulting to MPI_TS

MPI-TS does not make asynchronous progress, which likely explains why this program deadlocks. Please set ARMCI_NETWORK=MPI-PR and recompile GA. Note that this will require you to run on at least 2 processes per node, because 1 process per node is dedicated to communication.

I would also recommend initializing MPI before GA and PETSc, just to be safe. You can do this with Boost.MPI.

In any case, a GA program that deadlocks because of lack of progress is suggestive of a suboptimal usage model. I cannot say incorrect, because there is not a proper specification to reason about GA, but all of the GA programs I've studied are only sensitive to asynchronous progress in performance, not successful termination.

Hi, I am sorry for the late reply. I have reconfigure the GA with option "--with-mpi-pr" and make again. However, when I run my application it stuck in the first boost.broadcast(I have put boost.mpi init before GA init).

I thought it because the inclusion of " the dedicated communication rank". from the output, I guess in default the last rank is set as the dedicated communication rank. So I make a new communication group "local" by world.split(the boost.mpi API) . The local group exclude the last rank. I use this to broadcast and it past. And a stranger thing is after the first broadcast, there is another broadcast using communication "world" pass through successfully. The difference between these two broadcast is : first "stuck" broadcast is sending some class object processed by boost.serialization , the another "passed" broadcast is sending an integer.

But then it stuck again in the PETSc initilation(actually SLEPc, the extension of PETSc and build upon PETSc). I have no idea how to fix it. And now I haven't test whether MPI_PR will correct the stuck referred in the beginer of this issue.

my question is :

  1. how to fix the stuck in PETSc init?
  2. how to select the suitable mode among MPI_PR MPI_MT .etc. Or is there any document or web introducing their difference?
  3. when use MPI_PR, should I use if (rank != last rank) everywhere? Or what is the proper way to write code wit MPI_PR mode.
  4. why the first broadcast stuck with global communicator "world" but pass with the local communicator excluding the last rank, and the another broadcast can pass with "world"?

Thank you !

For now, it might be simpler to try ARMCI-MPI. Grab https://github.com/nwchemgit/nwchem/blob/master/src/tools/install-armci-mpi and configure with --with-armci=$EXTERNAL_ARMCI_PATH. You'll want to set NWCHEM_TOP to something appropriate given EXTERNAL_ARMCI_PATH=${NWCHEM_TOP}/external-armci is happening in that script.

Is it possible for me to reproduce what you are doing? I am familiar enough with PETSc and can figure out Boost.MPI. I just need some clue on how to build your app.

The other thing you could do with GA and MPI-PR, is use GA_MPI_Comm to get the new "world" communication for use with PETSc. See https://github.com/GlobalArrays/ga#how-to-use-progress-ranks for details.

For now, it might be simpler to try ARMCI-MPI. Grab https://github.com/nwchemgit/nwchem/blob/master/src/tools/install-armci-mpi and configure with --with-armci=$EXTERNAL_ARMCI_PATH. You'll want to set NWCHEM_TOP to something appropriate given EXTERNAL_ARMCI_PATH=${NWCHEM_TOP}/external-armci is happening in that script.

Is it possible for me to reproduce what you are doing? I am familiar enough with PETSc and can figure out Boost.MPI. I just need some clue on how to build your app.

Do you mean the stuck is aroused by inappropriate installation of armci? I will try to do this soon.

And It's so kind of you for being willing to reproduce the problem. The attachment is a excerpt of my app. I remove a lot of things to make it as clear as possible. And I am sorry that I do not have right to show some class but I think it will not hinder understanding the program. I use //!!!!!! note the stuck point in the program.

Maybe you know the theory of my app. This is a full CI calculation program. What I want to do is :

  1. produce the CI vector and store it in the global char array.
  2. every rank get a part of CI vector <psi_i| and calculate the Hamiltonian matrix elements in the matrix row corresponding to the rank's local <psi_i|. Then use PETSc's matSetValue store these elements.
  3. call SLEPc's Davidson solver to get the low eigenvalue and corresponding eigenvector.

Just show me if there unclear thing or something else needed. Thanks for your help!

fullCI.txt

At this point, I don't know the root cause but more information is better. I'll see what I can learn from this.

The other thing you could do with GA and MPI-PR, is use GA_MPI_Comm to get the new "world" communication for use with PETSc. See https://github.com/GlobalArrays/ga#how-to-use-progress-ranks for details.

Hello! You are right. The reason why the app stuck in 'PetscInitialize’ is that I didn't pass 'GA_MPI_Comm' to PETSc. When I use GA_MPI_Comm in boost and PETSc, everything works normally.

Thanks for your help!


Summary(maybe something is wrong?)

  1. Without setting ARMCI network mode in compile stage, the MPI_TS network will be adopted. In the MPI_TS network mode, GA will not support asynchronous progress, which may cause conflict between GA and PETSc in processes' communicating and stuck (in my case is GA.GET and PETSc's matAssembly. ). So designate the ARMCI network as MPI_PR like
./configure --with-mpi-pr=1  --...(other option)
  1. About ARMCI and MPI_PR, here are some introduction. In short, ARMCI manage the one-sided communication based on MPI. It will use one process in each node to manage the communication, which means there are only 3 ranks can be used to do the actual computation when using mpirun -n 4 .

  2. How to select the processes that can be used to do actual computation with MPI_PR? here is the how. Follow this introduction, what I did is as below when using GA(MPI_PR) with boost and PETSc(the same goes for SLEPc)

attention: app will stuck in PETSc initialize if not do the a) and c).

a) Make the new communicator(as 'comm' below) by GA. In this new communicator, there are only the processes can be used to do the actual computation.

#include "ga-mpi.h"
MPI_Comm comm = GA_MPI_Comm();

b) Pass the new communicator to boost, make new boost's communicator(as 'local' below). Use this new boost's communicator do every communication in boost operation.

boost::mpi::comm_create_kind kind;
kind = boost::mpi::comm_duplicate;  
boost::mpi::communicator local(comm, kind);

c) Pass the new communicator to PETSc(this is enough to SLEPc), and nothing else need to do with PETSc and SLEPc

PETSC_COMM_WORLD = comm;

Awesome. I'm glad it works.