marekandreas/elpa

Deadlock in ELPA-gpu when making consecutive calls with different comm sizes (due to check_for_gpu()'s MPI_Allreduce)

Closed this issue · 16 comments

Hi,

I have observed that CP2K deadlocks with certain rank counts when running with the ELPA backend. I don't know ELPA very well, but with the debugger I think I could gather enough info to pinpoint the problem.

This is what happens in the run, I run with 8 MPI ranks.

  1. CP2K calls ELPA solver with a comm of size 4.
  2. ELPA initializes the GPUs for the 4 ranks in check_for_gpu() #442.
  3. ELPA solver succeeds, execution continues normally.
  4. Later, CP2K calls ELPA solver again, but with a comm of size 8
  5. Now there is a deadlock at check_for_gpu() #507. As the first 4 ranks were already initialized, they have exited the function earlier #453, and do not reach the Allreduce, so the last 4 ranks will hang there forever.

I don't know whether calling ELPA with different comm sizes is allowed or not? But my first thought would be that check_for_gpu() should first query the value of all rank's gpuIsInitialized and restart the initialization for everyone if any of the ranks was not initialized, this seems to fix the deadlock in my case.

Thank you for this detailed analysis. We will verify this, and then correct it. I do agree that the check_for_gpu option should
be "reseted" if the communicator changes

Dear @LStuber,

could you provide a pseudo code what you do when the deadlock appears? Especially, when do you call elpa%setup and
when do you set the mpi_comm_parent and so forth communicators. And when you create new communicators and you want to use ELPA to use them what do you set via elpa%set ...
Thank you!

Unfortunately I can't provide that easily as I don't know which part of CP2K handles that setup, the only thing I know is that the comm size changes between calls. So, your message seems to imply that this is not an ELPA bug but rather a wrong setup, correct ? Could you advise what type of error could lead to this deadlock?

@LStuber: No, it is a bug. I am just trying to find the best way how to clean-up the code and remove the bug. When doing so I stumbled above the question whether it is a reasonable approach to enforce that users have to call elpa%setup when communicators are changed or whether ELPA should handle this transparently. Both approaches have pros and cons...

Well since you ask me, I would by far prefer the "transparent" approach as otherwise it would require the applications code to to be updated, which not only might take some time but also in my case that would mean no way to run older CP2K versions which some users prefer. I agree the thing I suggested is not the cleanest way to fix, but considering that this is a very annoying and difficult issue for users to debug, it would be great to have a quick workaround in short term if that is possible.

In CP2K we actually call elpa%setup each time before we call elpa%eigenvectors.

@oschuett Wenn using ELPA do you provide the "mpi_comm_cols" and "mpi_comm_rows" communicators yourself, or do you provide
on each rank the id of the row/column via "process_row" and "process_col" ?

We use process_row/col. Essentially, all our calls to ELPA are in cp_fm_elpa.F:473 - 556.

Fixing this bug will take a bit more time. It turned out, that this bug can only be triggered when using the ELPA object wrongly. ELPA was designed with the idea that certain characteristics of the object are immutable (essentially everything one has to
set via ELPA's set methods before calling the ELPA setup method). Especially the size of the matrix problem and how the matrix is distributed with BLACS.
However, due to another bug some of these characteristics could be changed after the setup of the ELPA object, which leads to inconsistencies and eventually leads to the bug reported above.
Fixing this is not very difficult per se, however, fixing it such that after this fix changes how an application uses ELPA is difficult.

this bug can only be triggered when using the ELPA object wrongly.

So, how would I have to change CP2K's code to avoid the deadlock?

Dear @oschuett,
I do not know about the interals of CP2K, but what I recommend is to setup one ELPA object (with fixed mpi_comm_parent), for the number of MPI tasks you want to use. So if you first want to run ELPA with 4 MPI processes, create such a communicator and create with this an ELPA object. If you want later to use a different number of MPI tasks, in the above example 8, create a new MPI communicator and from this create an ELPA object. Of course, once not needed anymore you can free "old" ELPA objects. Would this be a viable approach for CP2K?

Of course, once not needed anymore you can free "old" ELPA objects. Would this be a viable approach for CP2K?

In CP2K we are in fact creating a new ELPA object for each invocation.

The problem is that gpuIsInitialized is an implicit SAVE variable:

logical :: gpuIsInitialized=.false.

So, some of the state is stored in a global variable instead of the ELPA object.

I do not know about the interals of CP2K

All the relevant code is here - it's literally less than a 100 lines between elpa_allocate() and elpa_deallocate().

Dear @oschuett,
oh in deed the global variable, slipped my attention and is bad. I will fix this for the rc2 of the 2022.11.001 release and ensure that all settings are per object, and not globally

Dear @oschuett , @fstein, @LStuber ,

this problem has been fixed now, by storing the GPU settings per ELPA object, and the global variables have been removed. This is currently in the master_pre_stage branch, and will in the next days appear in rc2 of ELPA 2022.11.001

Release candidate ELPA 2022.11.001.rc2 has been published, which fixes this issue

Dear @marekandreas,
thanks a lot for the fix and the release candidate!
I've upgrade the CP2K toolchain and everything looks good :-)