ufs-community/ufs-weather-model

Should we update FindESMF.cmake and maybe the name of the imported esmf target

Opened this issue ยท 38 comments

After updating esmf and mapl libraries to 8.6.1 and 2.46.2 respectively, compilation of the ufs-weather-model fails with the following error:

CMake Error at /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/ue-esmf-8.6.1-mapl-2.46.2/install/intel/2021.9.0/mapl-2.46.2-uiwt3at/lib64/cmake/MAPL/MAPL-targets.cmake:73 (set_target_properties):
  The link interface of target "MAPL_cfio_r4" contains:

    ESMF::ESMF

  but the target was not found.  Possible reasons include:

    * There is a typo in the target name.
    * A find_package call is missing for an IMPORTED target.
    * An ALIAS target is missing.

See: #2345 (comment)

This happens due to name mismatch of the esmf imported target. The name of the imported target from FindESMF module that we currently use is esmf, while mapl expects ESMF::ESMF.

In order to fix this error we can either define an alias that maps old names to the new one or (maybe) update FindESMF.cmake modules.

In ufs-weather-model and all of its components I see FindESMF.cmake in these places and there are at least 4 different versions:

$ find . -name FindESMF.cmake -exec md5sum {} \;
a1aa4edbb8da8d79205ff1f67c195fd4  ./CMEPS-interface/CMEPS/cmake/FindESMF.cmake
cbb94fb7f41a58091874309b805ad9ba  ./NOAHMP-interface/noahmp/cmake/FindESMF.cmake
a1aa4edbb8da8d79205ff1f67c195fd4  ./CDEPS-interface/CDEPS/cmake/FindESMF.cmake
b9fe2be3b7fec1f9b94b787784e20e32  ./FV3/atmos_cubed_sphere/cmake/FindESMF.cmake
1b4b05db0c4ba67dfc8b95fd9041866f  ./WW3/cmake/FindESMF.cmake
1b4b05db0c4ba67dfc8b95fd9041866f  ./CMakeModules/Modules/FindESMF.cmake

It would be nice to update all of these with the one from the ESMF library itself, this one:

https://github.com/esmf-org/esmf/blob/6c1de1445587a95a77f5c45a9f2b61df5297f9e4/cmake/FindESMF.cmake

We could also use this opportunity to rename all imported esmf targets to 'ESMF::ESMF' since that is the target esmf library exports. They also provide aliases for other names but the actual target is ESMF::ESMF.

However, this practice of keeping local copies of FindESMF.cmake everywhere is not ideal and better solution will probably be to update all CMakeLists.txt files to find 'FindESMF.cmake' in CMAKE_MODULE_PATH which can be defined to:

list(APPEND CMAKE_MODULE_PATH $ENV{esmf_ROOT}/cmake)

and simply remove all FindESMF.cmake from all above locations.

Which solution do you prefer:

  1. Add yet another esmf target alias
  2. Update all FindESMF.cmake modules and rename esmf targets
  3. Find FindESMF in esmf_ROOT and get rid of all local copies of FindESMF.cmake

@DusanJovic-NOAA I came across this as well. I am in support of 3. See also NOAA-EMC/CMakeModules#70 which I created after running into a different problem with a findESMF in jedi-cmake. I think that with spack-stack, loading the esmf module may automatically append the location of it's findESMF.cmake to CMAKE_MODULE_PATH.

Note that I used a workaround like your option 1 to test the ufs-weather-model with esmf@8.6.1 and mapl@2.46.2, but I ran into runtime errors that something was trying to initialize MPI twice. That happened for the fully-coupled run, but not the control_c48 run. See JCSDA/spack-stack#1226 (comment) and comments below.

I support solution 3 as well. That will resolve future maintenance issue for FindESMF.cmake.

@climbfuji Thanks. I also see error when I run cpld_control_p8:

121: Abort(805361423) on node 121 (rank 121 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
121: MPI_Init_thread(307): Cannot call MPI_INIT or MPI_INIT_THREAD more than once
 99: Abort(805361423) on node 99 (rank 99 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
 99: MPI_Init_thread(307): Cannot call MPI_INIT or MPI_INIT_THREAD more than once

However cpld_control_noaero_p8 works fine. It even reproduces b4b current baselines.

The difference between two tests is that cpld_control_p8 runs with GOCART (and MAPL). The cpld_control_noaero_p8 does not. Which means it must be something with MAPL library. Problem with MAPL code or ufs (GOCART) code that calls MAPL, or how MAPL is built, or how it is configured during build or at runtime.

Not sure if changes are needed in MAPL side. Let me include @mathomp4, he mentioned the findESMF.cmake in MAPL in the ESMFv8.6.1/MAPLv2.46.2 installation issue #1168::

The one we have in MAPL and the one we have in ESMA_cmake are identical to the one in ESMF.

Wow. I was totally unaware of all these spac-stack/jedi/esmf/mapl issues with esmf target names, aliases and multiple copies of FindESMF.cmake.

Now I'm even more convinced that the best solution is to get rid of FindESMF.cmake from everywhere and use the one that comes with the esmf installation.

If everybody agrees I can create a ufs-weather-model branch and start removing FindESMF. First from CMakeModules (@climbfuji do you have a CMakeModule branch that removes it, that I can just point from ufs's gitmodules?), then from other components.

Wow. I was totally unaware of all these spac-stack/jedi/esmf/mapl issues with esmf target names, aliases and multiple copies of FindESMF.cmake.

Now I'm even more convinced that the best solution is to get rid of FindESMF.cmake from everywhere and use the one that comes with the esmf installation.

If everybody agrees I can create a ufs-weather-model branch and start removing FindESMF. First from CMakeModules (@climbfuji do you have a CMakeModule branch that removes it, that I can just point from ufs's gitmodules?), then from other components.

I don't have a branch in CMakeModules. Both spack-stack and jedi-cmake use CMakeModules as a submodule, i.e. once we have findESMF.cmake removed from CMakeModules I can take care of removing it from the other two repos by updating the submodule pointer

Wow. I was totally unaware of all these spac-stack/jedi/esmf/mapl issues with esmf target names, aliases and multiple copies of FindESMF.cmake.
Now I'm even more convinced that the best solution is to get rid of FindESMF.cmake from everywhere and use the one that comes with the esmf installation.
If everybody agrees I can create a ufs-weather-model branch and start removing FindESMF. First from CMakeModules (@climbfuji do you have a CMakeModule branch that removes it, that I can just point from ufs's gitmodules?), then from other components.

I don't have a branch in CMakeModules. Both spack-stack and jedi-cmake use CMakeModules as a submodule, i.e. once we have findESMF.cmake removed from CMakeModules I can take care of removing it from the other two repos by updating the submodule pointer

I can fork CMakeModules and create a branch, but I see you have https://github.com/climbfuji/CMakeModules/tree/feature/remove_findesmf. Do you want to update your branch with origin and I can just point to your branch from my ufs remove_findesmf branch, or do you prefer that I create a CMakeModules branch?

Oh I do! Let me update it. Sorry!

Unfortunately option 3 (removing FindESMF.cmake) will not work, at least for now, because on WCOSS2, the esmf install tree does not have FindESMF.cmake in ${ESMF_ROOT}/cmake.

$ ll /apps/prod/hpc-stack/i-19.1.3.304__m-8.1.12__h-1.14.0__n-4.9.2__p-2.5.10__e-8.6.0pnetcdf/intel-19.1.3.304/cray-mpich-8.1.12/esmf/8.6.0/
total 28
drwxr-xr-x 2 hpc-adm hpc-adm  4096 Apr 22 19:19 bin
drwxr-xr-x 2 hpc-adm hpc-adm  4096 Apr 22 19:19 doc
drwxr-xr-x 3 hpc-adm hpc-adm  4096 Apr 22 19:19 include
drwxr-xr-x 2 hpc-adm hpc-adm  4096 Apr 22 19:19 lib
drwxr-xr-x 2 hpc-adm hpc-adm 12288 Apr 22 19:19 mod

Why is this esmf install different?

Why is this esmf install different?

I'm not sure, but my guess is because it is not installed by spack-stack, but either by hpc-stack, or maybe manually.

In my opinion, at some point we need to stop making our (spack-stack developers/users) lives harder because one single system doesn't want to use it. Can we forge ahead and put a workaround in place for wcoss2? Ask Hang to fix the install and put the correct findESMF.cmake in place for esmf 8.6.1 and going forward?

In my opinion, at some point we need to stop making our (spack-stack developers/users) lives harder because one single system doesn't want to use it. Can we forge ahead and put a workaround in place for wcoss2? Ask Hang to fix the install and put the correct findESMF.cmake in place for esmf 8.6.1 and going forward?

I agree, let's move forward.

Regarding WCOSS2, I'm afraid Hang is not in the control of WCOSS2's library installation. @Hang-Lei-NOAA

Regarding the installation of cmake/FindESMF.cmake in the esmf install tree, I see that is done by spack in post_install step. I think esmf's make install step should do that, not spack. That way you get 'correct' install tree regardless on what tool is used for installation.

Regarding the installation of cmake/FindESMF.cmake in the esmf install tree, I see that is done by spack in post_install step. I think esmf's make install step should do that, not spack. That way you get 'correct' install tree regardless on what tool is used for installation.

That warrants an issue in the esmf repo ;-)

But for now, can we just manually add that file on wcoss2 for 8.6.1 and later releases until that is fixed upstream?

Regarding that runtime error:

 121: Abort(805361423) on node 121 (rank 121 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
121: MPI_Init_thread(307): Cannot call MPI_INIT or MPI_INIT_THREAD more than once
 99: Abort(805361423) on node 99 (rank 99 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
 99: MPI_Init_thread(307): Cannot call MPI_INIT or MPI_INIT_THREAD more than once

I looked at the differences in MAPS v2.40.3 that we currently use vs. v2.46.2 that we are now testing in updated spack-stack installation, and I see that call to MPI_INIT_THREAD is always called. Look at this diff (gridcomps/Cap/MAPL_Cap.F90):

GEOS-ESM/MAPL@v2.40.3...v2.46.2#diff-76548b59d6d641f6b265b8bd082aa2886774eb1966b327f0864a64c777c726c6

In v2.46.2 this%mpi_already_initialized is initialized to .false. and then in subroutine initialize_mpi:

450       !call MPI_Initialized(this%mpi_already_initialized, ierror)
451       !_VERIFY(ierror)
452       call  ESMF_InitializePreMPI(_RC)
453
454       if (.not. this%mpi_already_initialized) then
455          call MPI_Init_thread(MPI_THREAD_MULTIPLE, provided, ierror)
456          _ASSERT(provided == MPI_THREAD_MULTIPLE, 'MPI_THREAD_MULTIPLE not supported by this MPI.')
457 !         call MPI_Init_thread(MPI_THREAD_SINGLE, provided, ierror)
458 !         _VERIFY(ierror)
459 !         _ASSERT(provided == MPI_THREAD_SINGLE, "MPI_THREAD_SINGLE not supported by this MPI.")
460       end if

The if condition is always .true. since call MPI_Initialized(this%mpi_already_initialized, ierror) is commented out, and so call MPI_Init_thread is always executed.

In v2.40.3 that call MPI_Initialized(this%mpi_already_initialized, ierror) was not commented out.

Could this be the reason for the error we see?

Hmm. That is an odd one. I'm going to ping @tclune and @aoloso (as he commented that out). Perhaps we are not handling "MPI initialized outside of MAPL" in a nice way anymore.

Maybe one reason it was commented out was because MPI has no way to to tell how it was initialized? I mean MPI_Initialized() only says true/false, not "you initialized as MPI_THREAD_MULTIPLE"? Hoo boy.

Hmm. That is an odd one. I'm going to ping @tclune and @aoloso (as he commented that out). Perhaps we are not handling "MPI initialized outside of MAPL" in a nice way anymore.

Maybe one reason it was commented out was because MPI has no way to to tell how it was initialized? I mean MPI_Initialized() only says true/false, not "you initialized as MPI_THREAD_MULTIPLE"? Hoo boy.

Maybe you can first check if MPI was initialized already with MPI_Initialized(), and if it is use MPI_Query_thread() to get thread support level, and then based on that do something, either continue or return some error, which should be better than just crashing the program.

@DusanJovic-NOAA Ahh. Good idea. Let me try and figure out something for you to test at least. We never run with externally initialized MPI so it's hard for me to test (and I suppose why we never saw this).

Thanks @mathomp4 and @DusanJovic-NOAA - we want to cut release branches for spack-stack-1.8.0 next week (and then allow for a week or two of testing). It would be great if we could get this fixed before we roll out 1.8.0 (currently we have mapl@2.46.2 on the books).

@climbfuji Well, at most I can promise I can get something that GEOS-MAPL is happy with (since I can test that...assuming the disks on discover let me grr...).

I'm going to ask around our group to see if I can figure out if anyone has run GEOS-MAPL with externally-initialized MPI but it would be good if @DusanJovic-NOAA or someone could test my PR changes in their code.

Actually, we might have figured out a cheeky way to get MPI initialized outside how we normally do it. I'm testing now. (It's an ugly hardcoded hack, but that's fine.)

Okay. MAPL 2.46.3 is out and a PR is into spack mainline:

spack/spack#45795

In my testing, it seems to work with an externally-initialized MPI as:

      call MPI_Init_thread(MPI_THREAD_MULTIPLE, provided, ierror)

Note that if you use ESMF SSI support, you might need to call:

call ESMF_InitializePreMPI()

before MPI_Init_thread but if you do not know what that means, it shouldn't matter.

Please test and let us know if another fix is needed.

Thanks @mathomp4.
@jkbk2004 can somebody from your team rebuild the spack-stack using MAPL 2.46.3.

@mathomp4 I have another question for you regarding the esmf target name. I see you changed the name to ESMF::ESMF in MAPL. Is there a plan to do the same in GOCART? Do you think GOCART code managers would accept a change that renames it to ESMF::ESMF?

@DusanJovic-NOAA That change is now in GOCART develop see: GEOS-ESM/GOCART#286

GOCART develop also has a change from @rmontuoro as well: GEOS-ESM/GOCART#284

I gather the thinking is for a release of GOCART 2.3 "soon", but for testing you could try GEOS-ESM/GOCART@83b5c22

@DusanJovic-NOAA That change is now in GOCART develop see: GEOS-ESM/GOCART#286

GOCART develop also has a change from @rmontuoro as well: GEOS-ESM/GOCART#284

I gather the thinking is for a release of GOCART 2.3 "soon", but for testing you could try GEOS-ESM/GOCART@83b5c22

Good. Thank you. I should have looked at the top of develop branch first.

When I try GEOS-ESM/GOCART@83b5c22 I get this cmake error:

CMake Error at GOCART/ESMF/UFS/cmake/ufs_esma_add_library.cmake:111 (export):
  export FILE <filename> option missing.
Call Stack (most recent call first):
  GOCART/ESMF/Aerosol_GridComp/CMakeLists.txt:5 (esma_add_library)

Of course this does not look like it is not related to esmf target issue. It must be something else.

We definitely don't usually go into the UFS/ directory. I mean, I changed the CMakeLists.txt in there, but only the esmf --> ESMF::ESMF change.

From what I see that variable PROJECT_TARGETS_FILE is declared in https://github.com/GEOS-ESM/GOCART/blob/develop/ESMF/UFS/cmake/ufs_ecbuild_declare_project.cmake

There was one other change recently to prevent a circular CMake dependency: GEOS-ESM/GOCART#273

I wonder if my test for standalone is too broad or just wrong for UFS and it's never getting into the bits where project is declared:

foreach (dir IN LISTS ESMA_CMAKE_DIRS)
  if (EXISTS ${CMAKE_CURRENT_LIST_DIR}/${dir})
    list (APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_LIST_DIR}/${dir}")
    set (ESMA_CMAKE_PATH "${CMAKE_CURRENT_LIST_DIR}/${dir}" CACHE PATH "Path to ESMA_cmake code")
    set(GOCART_STANDALONE TRUE)
  endif ()
endforeach ()

if (GOCART_STANDALONE)
  project (
          GOCART
          VERSION 2.2.1
          LANGUAGES Fortran CXX C)  # Note - CXX is required for ESMF
...

Can you try setting -DGOCART_STANDALONE=ON in your build? If that works, maybe we need an:

if (UFS_GOCART)
  set(GOCART_STANDALONE ON)
endif ()

?

We definitely don't usually go into the UFS/ directory. I mean, I changed the CMakeLists.txt in there, but only the esmf --> ESMF::ESMF change.

From what I see that variable PROJECT_TARGETS_FILE is declared in https://github.com/GEOS-ESM/GOCART/blob/develop/ESMF/UFS/cmake/ufs_ecbuild_declare_project.cmake

There was one other change recently to prevent a circular CMake dependency: GEOS-ESM/GOCART#273

I wonder if my test for standalone is too broad or just wrong for UFS and it's never getting into the bits where project is declared:

foreach (dir IN LISTS ESMA_CMAKE_DIRS)
  if (EXISTS ${CMAKE_CURRENT_LIST_DIR}/${dir})
    list (APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_LIST_DIR}/${dir}")
    set (ESMA_CMAKE_PATH "${CMAKE_CURRENT_LIST_DIR}/${dir}" CACHE PATH "Path to ESMA_cmake code")
    set(GOCART_STANDALONE TRUE)
  endif ()
endforeach ()

if (GOCART_STANDALONE)
  project (
          GOCART
          VERSION 2.2.1
          LANGUAGES Fortran CXX C)  # Note - CXX is required for ESMF
...

Can you try setting -DGOCART_STANDALONE=ON in your build? If that works, maybe we need an:

if (UFS_GOCART)
  set(GOCART_STANDALONE ON)
endif ()

?

@mathomp4 Thanks. With GOCART_STANDALONE set to ON I can compile the top of develop GOCART branch as a submodule in ufs. But I have to wait until MAPL 2.46.3 is installed to rerun regression tests.

@DusanJovic-NOAA Okay. I've made a PR to GOCART to "automate" that for you:

GEOS-ESM/GOCART#287

Though for now, might be best to add that to your CMake until we can get it in.

On WCOSS2 the currently used esmf installation does not have FindESMF in usual <prefix>/cmake/FindESMF.cmake location, but there are two copies of this file in two other directories:

$ pwd
/apps/prod/hpc-stack/i-19.1.3.304__m-8.1.12__h-1.14.0__n-4.9.2__p-2.5.10__e-8.6.0pnetcdf/intel-19.1.3.304/cray-mpich-8.1.12/esmf/8.6.0

$ find . -name FindESMF.cmake
./include/ESMX/Comps/ESMX_Data/cmake/FindESMF.cmake
./include/ESMX/Driver/cmake/FindESMF.cmake

So, temporarily (until we start using spack-stack on wcoss2) we can explicitly add one of these two directories to CMAKE_MODULE_PATH, for example:

if(CMAKE_Platform STREQUAL "wcoss2")
  list(APPEND CMAKE_MODULE_PATH $ENV{ESMF_ROOT}/include/ESMX/Driver/cmake)
endif()

Is this acceptable for now?

Thanks @mathomp4 and @DusanJovic-NOAA - we want to cut release branches for spack-stack-1.8.0 next week (and then allow for a week or two of testing). It would be great if we could get this fixed before we roll out 1.8.0 (currently we have mapl@2.46.2 on the books).

@climbfuji I see several nceplib libraries in the current spack-stack, or at least the versions used in ufs-weather-model, are behind the latest versions available, for example:

  bacio            2.4.1    ------> 2.6.0
  crtm             2.4.0    ------> 3.1.1
  ip               4.3.0    ------> 5.1.0
  w3emc            2.10.0   ------> 2.12.0

Will those be updated in 1.8.0?

Thanks @mathomp4 and @DusanJovic-NOAA - we want to cut release branches for spack-stack-1.8.0 next week (and then allow for a week or two of testing). It would be great if we could get this fixed before we roll out 1.8.0 (currently we have mapl@2.46.2 on the books).

@climbfuji I see several nceplib libraries in the current spack-stack, or at least the versions used in ufs-weather-model, are behind the latest versions available, for example:

  bacio            2.4.1    ------> 2.6.0
  crtm             2.4.0    ------> 3.1.1
  ip               4.3.0    ------> 5.1.0
  w3emc            2.10.0   ------> 2.12.0

Will those be updated in 1.8.0?

@AlexanderRichert-NOAA or @RatkoVasic-NOAA typically handle those updates. For crtm, 1.8.0 will have 2.4.0 and the very latest crtm 3.1.1 (still being finalized, bugs found in 3.1.0 are addressed in that branch).

Regarding the other versions, the unified environment is the one that you are after. The ufs-weather-model template unfortunately isn't always updated. One way to see which versions are used in the unified environment in spack-stack develop (and then in 1.8.0, we cut the release branches mid week) is to look at configs/common/packages.yaml. All packages that we care about have their version defined as required. Another way is to look at the github actions for Ubuntu (GNU or Intel, doesn't matter) to see which versions of certain libraries are chosen in case they are not hardcoded in the packages.yaml file: https://github.com/JCSDA/spack-stack/actions/runs/10532451273/job/29186637420

Thanks @mathomp4 and @DusanJovic-NOAA - we want to cut release branches for spack-stack-1.8.0 next week (and then allow for a week or two of testing). It would be great if we could get this fixed before we roll out 1.8.0 (currently we have mapl@2.46.2 on the books).

@climbfuji I see several nceplib libraries in the current spack-stack, or at least the versions used in ufs-weather-model, are behind the latest versions available, for example:

  bacio            2.4.1    ------> 2.6.0
  crtm             2.4.0    ------> 3.1.1
  ip               4.3.0    ------> 5.1.0
  w3emc            2.10.0   ------> 2.12.0

Will those be updated in 1.8.0?

@AlexanderRichert-NOAA or @RatkoVasic-NOAA typically handle those updates. For crtm, 1.8.0 will have 2.4.0 and the very latest crtm 3.1.1 (still being finalized, bugs found in 3.1.0 are addressed in that branch).

Regarding the other versions, the unified environment is the one that you are after. The ufs-weather-model template unfortunately isn't always updated. One way to see which versions are used in the unified environment in spack-stack develop (and then in 1.8.0, we cut the release branches mid week) is to look at configs/common/packages.yaml. All packages that we care about have their version defined as required. Another way is to look at the github actions for Ubuntu (GNU or Intel, doesn't matter) to see which versions of certain libraries are chosen in case they are not hardcoded in the packages.yaml file: https://github.com/JCSDA/spack-stack/actions/runs/10532451273/job/29186637420

The w3emc in 1.8.0 branch is still 2.10.0

https://github.com/JCSDA/spack-stack/blob/release/1.8.0/configs/common/packages.yaml#L268-L269

I suppose nobody from NOAA has put in a request to add and/or update to this version.

Thanks @mathomp4. @jkbk2004 can somebody from your team rebuild the spack-stack using MAPL 2.46.3.

@jkbk2004 can somebody from your team rebuild the spack-stack using MAPL 2.46.3.

FWIW, map@2.46.3 is in the new spack-stack-1.8.0 that is being rolled out this week