CM4 highres crash in icebergs_framework with forrtl: severe (41): insufficient virtual memory
nikizadehgfdl opened this issue · 10 comments
CM4 highres model crashes on gaea with the following message and traceback on day 19 of the model run (probably because the model is trying to produce the first berg).
forrtl: severe (41): insufficient virtual memory
Image PC Routine Line Source
fms_cm4_sis2_comp 00000000055555DE Unknown Unknown Unknown
fms_cm4_sis2_comp 0000000000E5D225 ice_bergs_framewo 2119 icebergs_framework.F90
fms_cm4_sis2_comp 0000000000E9E87A ice_bergs_io_mp_w 1296 icebergs_io.F90
fms_cm4_sis2_comp 0000000000E19F8E ice_bergs_mp_iceb 3371 icebergs.F90
fms_cm4_sis2_comp 00000000004C305F sis_dyn_trans_mp_ 285 SIS_dyn_trans.F90
fms_cm4_sis2_comp 000000000046FF0E ice_model_mod_mp_ 232 ice_model.F90
fms_cm4_sis2_comp 000000000046FB8F ice_model_mod_mp_ 157 ice_model.F90
fms_cm4_sis2_comp 0000000000403FB5 MAIN__ 1034 coupler_main.F90
Line 2119 is an allocate statement
allocate(new%data(width,new_size))
I put a print statement before it that shows width=6 , new_size=1080275147 !!
Clearly new_size gets garbage which crashes the run.
This is happening on gaea. The ran before with the same executable for many years, but for some reason it started to crash like this a few weeks prior to the CLE7 update.
It still runs fine on Orion.
I put more print statements and found that the call at line 1296 of icebergs_io.F90 gets garbage numbers
ntrajs_rcvd_io= 1080346495 and 1081028687 from mpp_recv:
call mpp_recv(ntrajs_rcvd_io, glen=1, from_pe=from_pe, tag=COMM_TAG_11)
if (ntrajs_rcvd_io .gt. 0) then
call increase_ibuffer(ibuffer_io, ntrajs_rcvd_io,buffer_width_traj) line 1296
From print statements I see:
Tile-root pe receives the two bad numbers from pe127 and pe235.
Tile-root pe receives "0" from all other pes (normal behavior).
pe127 and pe235 both seem to send a "0" each.
So, how are these "0" sends interrupted and changed to bad numbers before they are received?
You've got to love the futuristic comments on icebergs_io.F90
io_layout=0,0 runs fine, in this case each pe writes its own file which bypasses the above send/recv block.
Hit the same issue when compiled and ran with the only other version of mpich available on gaea (cray-mpich/7.7.6).
The older successful runs were with cray-mpich/7.4.0 no longer available to test with.
Found that land was using the same MPI tags as icebergs. Changed the icebergs MPI tags to be unique, hit the same issue.
Hit the same issue after compiling SIS2/icebergs without openmp.
Finally I put a logic to trap the junk numbers and ignore them and saw a bunch them happening at the same timestep and then go away and do not show up again. This lets the model run.
if (ntrajs_rcvd_io .gt. 0) then
+ if (ntrajs_rcvd_io .gt. 10000000) then
+ print*,'WARNING write_trajectory got junk ',ntrajs_rcvd_io,from_pe,mpp_pe()
+ else
call increase_ibuffer(ibuffer_io, ntrajs_rcvd_io,buffer_width_traj)
WARNING write_trajectory got junk 1081003228 127 108
WARNING write_trajectory got junk 1081286236 128 108
WARNING write_trajectory got junk 1078358731 129 108
WARNING write_trajectory got junk 1080275122 235 216
WARNING write_trajectory got junk 1077851290 236 216
WARNING write_trajectory got junk 1078175925 130 108
WARNING write_trajectory got junk 1076590706 131 108
WARNING write_trajectory got junk 1081520236 260 216
WARNING write_trajectory got junk 1081457183 261 216
WARNING write_trajectory got junk 1082146344 262 216
WARNING write_trajectory got junk 1079991563 264 216
WARNING write_trajectory got junk 1080958438 265 216
WARNING write_trajectory got junk 1082072390 266 216
WARNING write_trajectory got junk 1079800498 267 216
WARNING write_trajectory got junk 1080201548 268 216
WARNING write_trajectory got junk 1081105849 269 216
WARNING write_trajectory got junk 1079399123 147 108
WARNING write_trajectory got junk 1081196633 148 108
WARNING write_trajectory got junk 1078023477 149 108
WARNING write_trajectory got junk 1082303273 150 108
WARNING write_trajectory got junk 1083201361 155 108
WARNING write_trajectory got junk 1083201139 156 108
WARNING write_trajectory got junk 1083190294 157 108
WARNING write_trajectory got junk 1083130777 158 108
WARNING write_trajectory got junk 1082965927 159 108
WARNING write_trajectory got junk 1082547767 160 108
WARNING write_trajectory got junk 1081079324 163 108
WARNING write_trajectory got junk 1080730703 164 108
WARNING write_trajectory got junk 1079544729 165 108
WARNING write_trajectory got junk 1076337821 166 108
WARNING write_trajectory got junk 1077993257 183 108
WARNING write_trajectory got junk 1079117300 184 108
WARNING write_trajectory got junk 1081362583 185 108
WARNING write_trajectory got junk 1082067169 186 108
WARNING write_trajectory got junk 1081809898 187 108
WARNING write_trajectory got junk 1082697372 188 108
WARNING write_trajectory got junk 1082910792 189 108
WARNING write_trajectory got junk 1082867133 190 108
WARNING write_trajectory got junk 1082775489 191 108
WARNING write_trajectory got junk 1082319811 192 108
WARNING write_trajectory got junk 1080387855 193 108
WARNING write_trajectory got junk 1082932685 194 108
WARNING write_trajectory got junk 1080222651 195 108
WARNING write_trajectory got junk 1082477423 196 108
WARNING write_trajectory got junk 1082375281 197 108
WARNING write_trajectory got junk 1081634444 198 108
WARNING write_trajectory got junk 1081207162 199 108
WARNING write_trajectory got junk 1080192649 200 108
WARNING write_trajectory got junk 1077080758 201 108
WARNING write_trajectory got junk 1079164729 297 216
WARNING write_trajectory got junk 1080845120 298 216
WARNING write_trajectory got junk 1081012269 300 216
WARNING write_trajectory got junk 1080673817 301 216
WARNING write_trajectory got junk 1077248540 302 216
WARNING write_trajectory got junk 1078530584 303 216
WARNING write_trajectory got junk 1080136256 304 216
I guess some other component is losing messages. Odd that it doesn't seem to matter. If you make the icebergs passive you test whether the work around is changing the state of other components.
Does the originating PEs of junk messages change with different i/o layouts (other than working properly with 0,0)?
There are only two PEs (108 and 216) receiving those junk messages: were there only two i/o PEs? If more, what's special about these two?
SIS2_io_layout = "1,4", so there are 4 io-tiles and root pes.
SIS2_layout = "36,12", so the two pes receiving junks are the roots of the two middle latitude io-bands.
I changed all iceberg MPI messages to be uniquely tagged at mpp_send and mpp_rec, so I don't see how they could come from or go to other components unless something is wrong with MPICH. The above workaround might affect iceberg trajectories itself because there are some 0's being sent that were ignored by the receiver, but they might hang around till the next trajectory write time and get received then.
There are no icebergs produced by the end of this 1 month run and there is no icebergs trajectory file being written. This also indicates that all pes send 0's, but root pes sometimes get junk.
We are here because traj_write_hrs=480 =20 days. I wonder if increasing that would bypass the issue.
I'll try more layouts.
For ice_layout = 36,12 and ice_io_layout = 1,4 , 2 of 4 root pes get junk (2 polar tiles are clean).
For ice_layout = 36,12 and ice_io_layout = 1,6 , 4 of 6 root pes get junk (2 polar tiles are clean).
For ice_layout = 24,18 and ice_io_layout = 1,6 , same as above
Answers are the same for different layouts.
Changing traj_write_hrs=30 days, I again get junk numbers but on day 30. Must have the logic to ignore them.
Curious, root pes get junk from the same set of pes as before, but the junk numbers are different (still of the same order 108...).
Could you try io_layout=4,1? This way there might be bergs on all I/O PEs.
This happens at the very first call to write_trajectory for any io_layout (including 4,1), even when there are no icebergs anywhere in the run.
I found a trick to neutralize these junk numbers at initialization. It's like a pipe cleaner that flushes the buffers at initialization. It still remains to explain where the junks are coming from.
Introduce a call in ice_bergs_io_init
call mpp_cough
where mpp_cough just sends and receives 0's on I/O tiles:
subroutine mpp_cough
integer nsent,nrcvd,np,from_pe
nsent =0
nrcvd =0
if(is_io_tile_root_pe) then
do np=2,size(io_tile_pelist) ! Note: np starts from 2 to exclude self
from_pe=io_tile_pelist(np)
call mpp_recv(nrcvd, glen=1, from_pe=from_pe, tag=COMM_TAG_11)
if (nrcvd .ne. 0) then
print*,'mpp_cough warning: Got junk ',nrcvd,from_pe,mpp_pe()
endif
enddo
else
call mpp_send(nsent, plen=1, to_pe=io_tile_root_pe, tag=COMM_TAG_11)
endif
end subroutine mpp_cough
This call traps some junk numbers at initialization, but none appear in the main run afterwards. I will test this band-aid in a year long run to make sure it holds.
Rusty found the root cause of this issue in a FMS unit test I made. He wrote: "While the mpp_recv is blocking by default, the mpp_send is always non-blocking. Because of this, your non-root pes in the 'test_send_recv' and 'test_tile_send_recv' all exit the subroutine and nsent for those MPI-ranks becomes undefined. You have two ways to fix this - 1) add an mpp_sync_self at the end of each subroutine or 2) declare your variables as globals."
Based on his suggestion I set ntrajs_sent_io and ntrajs_rcvd_io as module variables and that fixed the CM4 highres issue.
I'll make a PR shortly.