f_comm%size and f_comm%rank always return 1 and 0
sebastienbarthelemy opened this issue · 8 comments
Hi,
I use fckit within a fortran code: bump-standalone. I noticed lately (while investigating another bug I have with bump) that the variables f_comm%size and f_comm%rank are always equal to 1 and 0 respectively while in this configuration f_comm%size should be equal to 6 and f_comm%rank should range from 0 to 5. I have no idea where the problem is coming from, can you provide some help ?
Could it be that eckit was not compiled with MPI support?
You can double check with executable fckit
fckit --info
"fckit --info" returns:
fckit version (0.7.0), git-sha1 8502aae
Build:
build type : Debug
timestamp : 20200728163322
op. system : Linux-3.10.0-1062.18.1.el7.x86_64 (linux.64)
processor : x86_64
sources : /cluster/projects/nn9039k/NorCPM/Code/code_bump/bump-standalone/fckit
c++ compiler : GNU 9.3.0
flags : -pipe -O0 -g
fortran compiler: GNU 9.3.0
flags : -O0 -g -fcheck=bounds -fbacktrace -finit-real=snan
Features:
MPI : ON
final : ON
eckit : ON
Dependencies:
eckit version (1.11.6), git-sha1 cb9a4d1cebbc9
That looks OK to me.
If you can create a branch with a failing unit-test within fckit, that would surely help to understand / diagnose the problem.
Hi, the minimal failing test I can provide is the following, it's equivalent to the test "test_default_comm" of the file test_mpi.F90 file provided with fckit:
program debug
use fckit_module
use fckit_mpi_module, only: fckit_mpi_comm
type(fckit_mpi_comm) :: f_comm
call fckit_main%init()
f_comm = fckit_mpi_comm()
write(*,*) 'f_comm%size()', f_comm%size()
write(*,*) 'f_comm%rank', f_comm%rank()
end program debug
Both this code and the test "test_default_comm" provide the same results when running the command "mpirun -n 4 debug":
f_comm%size() 1
f_comm%rank 0
f_comm%size() 1
f_comm%rank 0
f_comm%size() 1
f_comm%rank 0
f_comm%size() 1
f_comm%rank 0
That actually works OK for me.
There are heuristics in eckit to determine if MPI is actually being used ( with mpirun, aprun, srun, ... ). If not it will take a "serial" eckit MPI implementation. See https://github.com/ecmwf/eckit/blob/develop/src/eckit/mpi/Comm.cc#L38-L54
As you can see, it is possible to force the "parallel" backend by setting
export ECKIT_MPI_FORCE=parallel
Try if that works. If so, we need to improve the eckit heuristics to better detect you are running with mpirun
Hello. I am sorry, I have to say that the export of the variable ECKIT_MPI_FORCE did not bring any improvement and the outputs of the test remain the same.
I have built the bump-standalone project, with the JCSDA release-stable branch of ecbuild.
I have succesfully run ctest -R fckit_test_mpi -VV
:
test 118
Start 118: fckit_test_mpi
118: Test command: /usr/local/apps/openmpi/4.0.3/GNU/7.3.0/bin/mpiexec "-n" "4" "/tmp/nawd/src/bump/bump-standalone/build/fckit/src/tests/fckit_test_mpi"
118: Environment variables:
118: OMP_NUM_THREADS=1
118: Test timeout computed to be: 1500
118: test_default_comm
118: default size: 4
118: default rank: 2
118: test_comm
118: default size: 4
118: default rank: 2
118: world size: 4
118: default world: 2
118: test_default_comm
118: test_set_comm_default
118: test_uninitialised
118: test_allreduce
118: default size: 4
118: default rank: 0
118: test_comm
118: default size: 4
118: default rank: 0
118: test_default_comm
118: world size: 4
118: default world: 0
118: default size: 4
118: default rank: 3
118: test_comm
118: default size: 4
118: default rank: 3
118: world size: 4
118: default world: 3
118: test_set_comm_default
118: test_set_comm_default
118: test_uninitialised
118: test_allreduce
118: test_uninitialised
118: test_allreduce
118: test_default_comm
118: default size: 4
118: default rank: 1
118: test_comm
118: default size: 4
118: default rank: 1
118: world size: 4
118: default world: 1
118: test_set_comm_default
118: test_uninitialised
118: test_allreduce
118: test_allreduce_inplace
118: test_allreduce_inplace
118: test_allreduce_inplace
118: test_allreduce_inplace
118: test_allgather
118: test_allgather
118: test_allgather
118: test_allgather
118: test_broadcast
118: test_broadcast
118: test_broadcast
118: test_broadcast
118: test_nonblocking_send_receive
118: test_blocking_send_receive
118: test_blocking_send_receive_rank1
118: test_blocking_send_receive_int32_rank1
118: test_nonblocking_send_receive
118: test_nonblocking_send_receive
118: test_nonblocking_send_receive
118: test_blocking_send_receive
118: test_blocking_send_receive_rank1
118: test_blocking_send_receive_int32_rank1
118: test_blocking_send_receive_int64_rank1
118: test_blocking_send_receive_int64_rank1
118: receive-request: 2
118: test_blocking_send_receive
118: send-request: 1
118: test_blocking_send_receive
118: test_blocking_send_receive_rank1
118: test_blocking_send_receive_int32_rank1
118: test_blocking_send_receive_rank1
118: test_blocking_send_receive_int64_rank1
118: test_blocking_send_receive_int32_rank1
118: test_blocking_send_receive_int64_rank1
1/1 Test #118: fckit_test_mpi ................... Passed 1.71 sec
My bet is that eckit is not picking up MPI somehow during configuration. You can help it detect MPI by setting the environment variable MPI_HOME
to the root of the MPI installation.
Can you check that libeckit_mpi.so actually links with MPI?
ldd lib/libeckit_mpi.so
Hi Willem, thank you for taking the time to investigate that bug. Following the comment of Gilles Gouillardet on that page on stackoverflow I could fix the bug. When running a mpi application with the mpiexec command you have to make sure that the mpiexec is from the same libraries with those linked to your executable.
What happened in my case is that in order to be able to compile the whole package, saber, fckit... I had to load some modules. And the loading of these modules led to the loading of another module "OpenMPI/4.0.3-GCC-9.3.0" . So, in the end, libfckit.so was linked to the following mpi libraries:
libeckit_mpi.so => /cluster/projects/nn9039k/NorCPM/Code/code_bump/build/bump-standalone_debug/lib/libeckit_mpi.so (0x00002b2b10ea4000)
libmpicxx.so.12 => /cluster/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/lib64/libmpicxx.so.12 (0x00002b2b124f6000)
libmpifort.so.12 => /cluster/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/lib64/libmpifort.so.12 (0x00002b2b12716000)
libmpi.so.12 => /cluster/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/lib64/libmpi.so.12 (0x00002b2b12abf000)
while the command to run the test with bump was (file test/CTestTestfile.cmake):
"/cluster/software/OpenMPI/4.0.3-GCC-9.3.0/bin/mpiexec" "-n" "6" "/cluster/projects/nn9039k/NorCPM/Code/code_bump/build/bump-standalone_debug/bin/saber_bump.x" "testinput/bump_norcpm.yaml" "testoutput"
Now, I have corrected this command to avoid the conflict in libraries and it works fine. The correct command reads then:
"/cluster/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/bin64/mpiexec" "-n" "6" "/cluster/projects/nn9039k/NorCPM/Code/code_bump/build/bump-standalone_debug/bin/saber_bump.x" "testinput/bump_norcpm.yaml" "testoutput"