LLNL/Caliper

Question about MPI_Finalize

Opened this issue · 4 comments

We are using the Annotation interface in our code. We used to (in pre-2.10) be able to rely on the destroctor or some other Caliper internal mechanism to call MPI_Finalize before Caliper is shut down. However, in 2.10, we got something like

== CALIPER: default: mpireport: MPI is already finalized. Cannot aggregate output

Do we now have to call MPI_Finalize ourself?

Hi @ollielo, thanks for the report. Caliper should still trigger a flush on MPI_Finalize() itself.

What's your runtime configuration? It looks like you're manually configuring Caliper with CALI_SERVICES_ENABLE etc. In that case make sure you have the mpi service activated, e.g. CALI_SERVICES_ENABLE=aggregate,event,mpi,mpireport,timer. You can also try one of Caliper's built-in configuration recipes, e.g. CALI_CONFIG=runtime-report.

Thanks for your answer. I was kind of getting it reversed. Previously, we did not need to explicit shutdown Caliper and it seemed like Caliper will intercept our call to MPI_Finalize and shut itself down properly. Now we need to add an explicit call to

cali::Caliper::instance().finalize();

before we calling MPI_Finalize. What I want to understand are:

  1. Does/how does Caliper intercept call to MPI_Finalize?
  2. Is calling cali::Caliper::instance().finalize() a properly to manually shutdown Caliper.

Caliper should still intercept the MPI_Finalize() call. If you're configuring Caliper with CALI_SERVICES_ENABLE=... you'll need to add the mpi service for this to work. If you're using one of the built-in recipes (CALI_CONFIG=...) it should do that automatically.

By default we're using the GOTCHA library which comes with Caliper to intercept MPI calls. It explicitly intercepts the C API MPI_Finalize() call, so it can fail if you're using the C++ or Fortran MPI API. However you mentioned that it used to work in v2.9, which is curious - I don't think there were any changes between v2.9 and v2.10 in the way Caliper intercepts MPI_Finalize(), but we did update the GOTCHA library. You could try and run with CALI_LOG_VERBOSITY=1 GOTCHA_DEBUG=2 set as environment variables, that should give us some debug output and tell us if we at least attempt to intercept MPI_Finalize.

Finally, calling cali::Caliper::instance().finalize() explicitly is a bit of a hacky workaround but it should do the trick for now. But again, it should not be necessary to do this and I'm curious what's going wrong.

I tried what you suggested. Here is the relevant part of the log

[1721697/1721697][gotcha.c:150] - gotcha_rewrite_wrapper_orders for binding MPI_Init in tool caliper/mpi of priority -1
[1721697/1721697][gotcha.c:156] - Adding new entry for MPI_Init to hash table
[1721697/1721697][gotcha.c:324] - Symbol MPI_Init needs lookup operation
[1721697/1721697][gotcha.c:98] - Looking up exported symbols for MPI_Init
[1721697/1721697][gotcha.c:87] - Symbol MPI_Init found in /home/ollie/opt/spack/opt/spack/linux-fedora39-skylake_avx512/gcc-12.3.0/openmpi-4.1.6-fitrx5rjvdbidrx6xdkt5hu3s2dv7cij/lib/libmpi.so.40 at 0x7f02e8eb1b70
[1721697/1721697][gotcha.c:334] - Symbol MPI_Init needs binding from application
[1721697/1721697][gotcha.c:150] - gotcha_rewrite_wrapper_orders for binding MPI_Init_thread in tool caliper/mpi of priority -1
[1721697/1721697][gotcha.c:156] - Adding new entry for MPI_Init_thread to hash table
[1721697/1721697][gotcha.c:324] - Symbol MPI_Init_thread needs lookup operation
[1721697/1721697][gotcha.c:98] - Looking up exported symbols for MPI_Init_thread
[1721697/1721697][gotcha.c:87] - Symbol MPI_Init_thread found in /home/ollie/opt/spack/opt/spack/linux-fedora39-skylake_avx512/gcc-12.3.0/openmpi-4.1.6-fitrx5rjvdbidrx6xdkt5hu3s2dv7cij/lib/libmpi.so.40 at 0x7f02e8eb1cc0
[1721697/1721697][gotcha.c:334] - Symbol MPI_Init_thread needs binding from application
[1721697/1721697][gotcha.c:150] - gotcha_rewrite_wrapper_orders for binding MPI_Finalize in tool caliper/mpi of priority -1
[1721697/1721697][gotcha.c:156] - Adding new entry for MPI_Finalize to hash table
[1721697/1721697][gotcha.c:324] - Symbol MPI_Finalize needs lookup operation
[1721697/1721697][gotcha.c:98] - Looking up exported symbols for MPI_Finalize
[1721697/1721697][gotcha.c:87] - Symbol MPI_Finalize found in /home/ollie/opt/spack/opt/spack/linux-fedora39-skylake_avx512/gcc-12.3.0/openmpi-4.1.6-fitrx5rjvdbidrx6xdkt5hu3s2dv7cij/lib/libmpi.so.40 at 0x7f02e8eaba90
[1721697/1721697][gotcha.c:334] - Symbol MPI_Finalize needs binding from application
== CALIPER: default: Registered MPI service
== CALIPER: default: mpireport: MPI is already finalized. Cannot aggregate output.

It looks to me that both MPI_Init and MPI_Finalize are intercepted by GOTCHA. Any other suggestions?