ecmwf/fckit

f_comm%size and f_comm%rank always return 1 and 0

sebastienbarthelemy opened this issue · 8 comments

Hi,

I use fckit within a fortran code: bump-standalone. I noticed lately (while investigating another bug I have with bump) that the variables f_comm%size and f_comm%rank are always equal to 1 and 0 respectively while in this configuration f_comm%size should be equal to 6 and f_comm%rank should range from 0 to 5. I have no idea where the problem is coming from, can you provide some help ?

Could it be that eckit was not compiled with MPI support?
You can double check with executable fckit

fckit --info

CC @benjaminmenetrier

"fckit --info" returns:

fckit version (0.7.0), git-sha1 8502aae

Build:
build type : Debug
timestamp : 20200728163322
op. system : Linux-3.10.0-1062.18.1.el7.x86_64 (linux.64)
processor : x86_64
sources : /cluster/projects/nn9039k/NorCPM/Code/code_bump/bump-standalone/fckit
c++ compiler : GNU 9.3.0
flags : -pipe -O0 -g
fortran compiler: GNU 9.3.0
flags : -O0 -g -fcheck=bounds -fbacktrace -finit-real=snan

Features:
MPI : ON
final : ON
eckit : ON

Dependencies:
eckit version (1.11.6), git-sha1 cb9a4d1cebbc9

That looks OK to me.
If you can create a branch with a failing unit-test within fckit, that would surely help to understand / diagnose the problem.

Hi, the minimal failing test I can provide is the following, it's equivalent to the test "test_default_comm" of the file test_mpi.F90 file provided with fckit:

program debug
use fckit_module
use fckit_mpi_module, only: fckit_mpi_comm

type(fckit_mpi_comm) :: f_comm

call fckit_main%init()

f_comm = fckit_mpi_comm()
write(*,*) 'f_comm%size()', f_comm%size()
write(*,*) 'f_comm%rank', f_comm%rank()
end program debug

Both this code and the test "test_default_comm" provide the same results when running the command "mpirun -n 4 debug":

 f_comm%size()           1
 f_comm%rank           0
 f_comm%size()           1
 f_comm%rank           0
 f_comm%size()           1
 f_comm%rank           0
 f_comm%size()           1
 f_comm%rank           0

That actually works OK for me.
There are heuristics in eckit to determine if MPI is actually being used ( with mpirun, aprun, srun, ... ). If not it will take a "serial" eckit MPI implementation. See https://github.com/ecmwf/eckit/blob/develop/src/eckit/mpi/Comm.cc#L38-L54

As you can see, it is possible to force the "parallel" backend by setting

export ECKIT_MPI_FORCE=parallel

Try if that works. If so, we need to improve the eckit heuristics to better detect you are running with mpirun

Hello. I am sorry, I have to say that the export of the variable ECKIT_MPI_FORCE did not bring any improvement and the outputs of the test remain the same.

I have built the bump-standalone project, with the JCSDA release-stable branch of ecbuild.
I have succesfully run ctest -R fckit_test_mpi -VV:

test 118
    Start 118: fckit_test_mpi

118: Test command: /usr/local/apps/openmpi/4.0.3/GNU/7.3.0/bin/mpiexec "-n" "4" "/tmp/nawd/src/bump/bump-standalone/build/fckit/src/tests/fckit_test_mpi"
118: Environment variables: 
118:  OMP_NUM_THREADS=1
118: Test timeout computed to be: 1500
118:  test_default_comm
118:  default size:           4
118:  default rank:           2
118:  test_comm
118:  default size:           4
118:  default rank:           2
118:  world size:           4
118:  default world:           2
118:  test_default_comm
118:  test_set_comm_default
118:  test_uninitialised
118:  test_allreduce
118:  default size:           4
118:  default rank:           0
118:  test_comm
118:  default size:           4
118:  default rank:           0
118:  test_default_comm
118:  world size:           4
118:  default world:           0
118:  default size:           4
118:  default rank:           3
118:  test_comm
118:  default size:           4
118:  default rank:           3
118:  world size:           4
118:  default world:           3
118:  test_set_comm_default
118:  test_set_comm_default
118:  test_uninitialised
118:  test_allreduce
118:  test_uninitialised
118:  test_allreduce
118:  test_default_comm
118:  default size:           4
118:  default rank:           1
118:  test_comm
118:  default size:           4
118:  default rank:           1
118:  world size:           4
118:  default world:           1
118:  test_set_comm_default
118:  test_uninitialised
118:  test_allreduce
118:  test_allreduce_inplace
118:  test_allreduce_inplace
118:  test_allreduce_inplace
118:  test_allreduce_inplace
118:  test_allgather
118:  test_allgather
118:  test_allgather
118:  test_allgather
118:  test_broadcast
118:  test_broadcast
118:  test_broadcast
118:  test_broadcast
118:  test_nonblocking_send_receive
118:  test_blocking_send_receive
118:  test_blocking_send_receive_rank1
118:  test_blocking_send_receive_int32_rank1
118:  test_nonblocking_send_receive
118:  test_nonblocking_send_receive
118:  test_nonblocking_send_receive
118:  test_blocking_send_receive
118:  test_blocking_send_receive_rank1
118:  test_blocking_send_receive_int32_rank1
118:  test_blocking_send_receive_int64_rank1
118:  test_blocking_send_receive_int64_rank1
118:  receive-request:           2
118:  test_blocking_send_receive
118:  send-request:           1
118:  test_blocking_send_receive
118:  test_blocking_send_receive_rank1
118:  test_blocking_send_receive_int32_rank1
118:  test_blocking_send_receive_rank1
118:  test_blocking_send_receive_int64_rank1
118:  test_blocking_send_receive_int32_rank1
118:  test_blocking_send_receive_int64_rank1
1/1 Test #118: fckit_test_mpi ...................   Passed    1.71 sec

My bet is that eckit is not picking up MPI somehow during configuration. You can help it detect MPI by setting the environment variable MPI_HOME to the root of the MPI installation.
Can you check that libeckit_mpi.so actually links with MPI?

ldd lib/libeckit_mpi.so

Hi Willem, thank you for taking the time to investigate that bug. Following the comment of Gilles Gouillardet on that page on stackoverflow I could fix the bug. When running a mpi application with the mpiexec command you have to make sure that the mpiexec is from the same libraries with those linked to your executable.

What happened in my case is that in order to be able to compile the whole package, saber, fckit... I had to load some modules. And the loading of these modules led to the loading of another module "OpenMPI/4.0.3-GCC-9.3.0" . So, in the end, libfckit.so was linked to the following mpi libraries:

libeckit_mpi.so => /cluster/projects/nn9039k/NorCPM/Code/code_bump/build/bump-standalone_debug/lib/libeckit_mpi.so (0x00002b2b10ea4000)
libmpicxx.so.12 => /cluster/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/lib64/libmpicxx.so.12 (0x00002b2b124f6000)
libmpifort.so.12 => /cluster/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/lib64/libmpifort.so.12 (0x00002b2b12716000)
libmpi.so.12 => /cluster/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/lib64/libmpi.so.12 (0x00002b2b12abf000)

while the command to run the test with bump was (file test/CTestTestfile.cmake):

"/cluster/software/OpenMPI/4.0.3-GCC-9.3.0/bin/mpiexec" "-n" "6" "/cluster/projects/nn9039k/NorCPM/Code/code_bump/build/bump-standalone_debug/bin/saber_bump.x" "testinput/bump_norcpm.yaml" "testoutput"

Now, I have corrected this command to avoid the conflict in libraries and it works fine. The correct command reads then:

"/cluster/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/bin64/mpiexec" "-n" "6" "/cluster/projects/nn9039k/NorCPM/Code/code_bump/build/bump-standalone_debug/bin/saber_bump.x" "testinput/bump_norcpm.yaml" "testoutput"