NOAA-GFDL/icebergs

CM4 highres crash in icebergs_framework with forrtl: severe (41): insufficient virtual memory

nikizadehgfdl opened this issue · 10 comments

CM4 highres model crashes on gaea with the following message and traceback on day 19 of the model run (probably because the model is trying to produce the first berg).

forrtl: severe (41): insufficient virtual memory
Image              PC                Routine            Line        Source
fms_cm4_sis2_comp  00000000055555DE  Unknown               Unknown  Unknown
fms_cm4_sis2_comp  0000000000E5D225  ice_bergs_framewo        2119  icebergs_framework.F90
fms_cm4_sis2_comp  0000000000E9E87A  ice_bergs_io_mp_w        1296  icebergs_io.F90
fms_cm4_sis2_comp  0000000000E19F8E  ice_bergs_mp_iceb        3371  icebergs.F90
fms_cm4_sis2_comp  00000000004C305F  sis_dyn_trans_mp_         285  SIS_dyn_trans.F90
fms_cm4_sis2_comp  000000000046FF0E  ice_model_mod_mp_         232  ice_model.F90
fms_cm4_sis2_comp  000000000046FB8F  ice_model_mod_mp_         157  ice_model.F90
fms_cm4_sis2_comp  0000000000403FB5  MAIN__                   1034  coupler_main.F90

Line 2119 is an allocate statement

    allocate(new%data(width,new_size))

I put a print statement before it that shows width=6 , new_size=1080275147 !!
Clearly new_size gets garbage which crashes the run.

This is happening on gaea. The ran before with the same executable for many years, but for some reason it started to crash like this a few weeks prior to the CLE7 update.
It still runs fine on Orion.

I put more print statements and found that the call at line 1296 of icebergs_io.F90 gets garbage numbers
ntrajs_rcvd_io= 1080346495 and 1081028687 from mpp_recv:

call mpp_recv(ntrajs_rcvd_io, glen=1, from_pe=from_pe, tag=COMM_TAG_11)
if (ntrajs_rcvd_io .gt. 0) then
       call increase_ibuffer(ibuffer_io, ntrajs_rcvd_io,buffer_width_traj)               line 1296

From print statements I see:
Tile-root pe receives the two bad numbers from pe127 and pe235.
Tile-root pe receives "0" from all other pes (normal behavior).
pe127 and pe235 both seem to send a "0" each.

So, how are these "0" sends interrupted and changed to bad numbers before they are received?

You've got to love the futuristic comments on icebergs_io.F90

io_layout=0,0 runs fine, in this case each pe writes its own file which bypasses the above send/recv block.

Hit the same issue when compiled and ran with the only other version of mpich available on gaea (cray-mpich/7.7.6).
The older successful runs were with cray-mpich/7.4.0 no longer available to test with.

Found that land was using the same MPI tags as icebergs. Changed the icebergs MPI tags to be unique, hit the same issue.

Hit the same issue after compiling SIS2/icebergs without openmp.

Finally I put a logic to trap the junk numbers and ignore them and saw a bunch them happening at the same timestep and then go away and do not show up again. This lets the model run.

         if (ntrajs_rcvd_io .gt. 0) then
+          if (ntrajs_rcvd_io .gt. 10000000) then
+           print*,'WARNING write_trajectory got junk ',ntrajs_rcvd_io,from_pe,mpp_pe()
+          else 
            call increase_ibuffer(ibuffer_io, ntrajs_rcvd_io,buffer_width_traj)

  WARNING write_trajectory got junk   1081003228         127         108
 WARNING write_trajectory got junk   1081286236         128         108
 WARNING write_trajectory got junk   1078358731         129         108
 WARNING write_trajectory got junk   1080275122         235         216
 WARNING write_trajectory got junk   1077851290         236         216
 WARNING write_trajectory got junk   1078175925         130         108
 WARNING write_trajectory got junk   1076590706         131         108
 WARNING write_trajectory got junk   1081520236         260         216
 WARNING write_trajectory got junk   1081457183         261         216
 WARNING write_trajectory got junk   1082146344         262         216
 WARNING write_trajectory got junk   1079991563         264         216
 WARNING write_trajectory got junk   1080958438         265         216
 WARNING write_trajectory got junk   1082072390         266         216
 WARNING write_trajectory got junk   1079800498         267         216
 WARNING write_trajectory got junk   1080201548         268         216
 WARNING write_trajectory got junk   1081105849         269         216
 WARNING write_trajectory got junk   1079399123         147         108
 WARNING write_trajectory got junk   1081196633         148         108
 WARNING write_trajectory got junk   1078023477         149         108
 WARNING write_trajectory got junk   1082303273         150         108
 WARNING write_trajectory got junk   1083201361         155         108
 WARNING write_trajectory got junk   1083201139         156         108
 WARNING write_trajectory got junk   1083190294         157         108
 WARNING write_trajectory got junk   1083130777         158         108
 WARNING write_trajectory got junk   1082965927         159         108
 WARNING write_trajectory got junk   1082547767         160         108
 WARNING write_trajectory got junk   1081079324         163         108
 WARNING write_trajectory got junk   1080730703         164         108
 WARNING write_trajectory got junk   1079544729         165         108
 WARNING write_trajectory got junk   1076337821         166         108
 WARNING write_trajectory got junk   1077993257         183         108
 WARNING write_trajectory got junk   1079117300         184         108
 WARNING write_trajectory got junk   1081362583         185         108
 WARNING write_trajectory got junk   1082067169         186         108
 WARNING write_trajectory got junk   1081809898         187         108
 WARNING write_trajectory got junk   1082697372         188         108
 WARNING write_trajectory got junk   1082910792         189         108
 WARNING write_trajectory got junk   1082867133         190         108
 WARNING write_trajectory got junk   1082775489         191         108
 WARNING write_trajectory got junk   1082319811         192         108
 WARNING write_trajectory got junk   1080387855         193         108
 WARNING write_trajectory got junk   1082932685         194         108
 WARNING write_trajectory got junk   1080222651         195         108
 WARNING write_trajectory got junk   1082477423         196         108
 WARNING write_trajectory got junk   1082375281         197         108
 WARNING write_trajectory got junk   1081634444         198         108
 WARNING write_trajectory got junk   1081207162         199         108
 WARNING write_trajectory got junk   1080192649         200         108
 WARNING write_trajectory got junk   1077080758         201         108
 WARNING write_trajectory got junk   1079164729         297         216
 WARNING write_trajectory got junk   1080845120         298         216
 WARNING write_trajectory got junk   1081012269         300         216
 WARNING write_trajectory got junk   1080673817         301         216
 WARNING write_trajectory got junk   1077248540         302         216
 WARNING write_trajectory got junk   1078530584         303         216
 WARNING write_trajectory got junk   1080136256         304         216

I guess some other component is losing messages. Odd that it doesn't seem to matter. If you make the icebergs passive you test whether the work around is changing the state of other components.

Does the originating PEs of junk messages change with different i/o layouts (other than working properly with 0,0)?

There are only two PEs (108 and 216) receiving those junk messages: were there only two i/o PEs? If more, what's special about these two?

SIS2_io_layout = "1,4", so there are 4 io-tiles and root pes.
SIS2_layout = "36,12", so the two pes receiving junks are the roots of the two middle latitude io-bands.

I changed all iceberg MPI messages to be uniquely tagged at mpp_send and mpp_rec, so I don't see how they could come from or go to other components unless something is wrong with MPICH. The above workaround might affect iceberg trajectories itself because there are some 0's being sent that were ignored by the receiver, but they might hang around till the next trajectory write time and get received then.

There are no icebergs produced by the end of this 1 month run and there is no icebergs trajectory file being written. This also indicates that all pes send 0's, but root pes sometimes get junk.

We are here because traj_write_hrs=480 =20 days. I wonder if increasing that would bypass the issue.

I'll try more layouts.

For ice_layout = 36,12 and ice_io_layout = 1,4 , 2 of 4 root pes get junk (2 polar tiles are clean).

For ice_layout = 36,12 and ice_io_layout = 1,6 , 4 of 6 root pes get junk (2 polar tiles are clean).

For ice_layout = 24,18 and ice_io_layout = 1,6 , same as above

Answers are the same for different layouts.

Changing traj_write_hrs=30 days, I again get junk numbers but on day 30. Must have the logic to ignore them.
Curious, root pes get junk from the same set of pes as before, but the junk numbers are different (still of the same order 108...).

Could you try io_layout=4,1? This way there might be bergs on all I/O PEs.

This happens at the very first call to write_trajectory for any io_layout (including 4,1), even when there are no icebergs anywhere in the run.

I found a trick to neutralize these junk numbers at initialization. It's like a pipe cleaner that flushes the buffers at initialization. It still remains to explain where the junks are coming from.

Introduce a call in ice_bergs_io_init

call mpp_cough

where mpp_cough just sends and receives 0's on I/O tiles:

subroutine mpp_cough
  integer nsent,nrcvd,np,from_pe
  nsent =0
  nrcvd =0
  if(is_io_tile_root_pe) then
     do np=2,size(io_tile_pelist) ! Note: np starts from 2 to exclude self
        from_pe=io_tile_pelist(np)
        call mpp_recv(nrcvd, glen=1, from_pe=from_pe, tag=COMM_TAG_11)
        if (nrcvd .ne. 0) then
           print*,'mpp_cough warning: Got junk ',nrcvd,from_pe,mpp_pe()
       endif
     enddo
  else
     call mpp_send(nsent, plen=1, to_pe=io_tile_root_pe, tag=COMM_TAG_11)
  endif
end subroutine mpp_cough

This call traps some junk numbers at initialization, but none appear in the main run afterwards. I will test this band-aid in a year long run to make sure it holds.

Rusty found the root cause of this issue in a FMS unit test I made. He wrote: "While the mpp_recv is blocking by default, the mpp_send is always non-blocking. Because of this, your non-root pes in the 'test_send_recv' and 'test_tile_send_recv' all exit the subroutine and nsent for those MPI-ranks becomes undefined. You have two ways to fix this - 1) add an mpp_sync_self at the end of each subroutine or 2) declare your variables as globals."

Based on his suggestion I set ntrajs_sent_io and ntrajs_rcvd_io as module variables and that fixed the CM4 highres issue.
I'll make a PR shortly.

Closed by #62