Exawind/amr-wind

Out-of-memory crash with large sampling

Opened this issue · 17 comments

static_box.txt
AMR-Wind crashes on Kestrel CPU nodes with out-of-memory error when large sampling is requested. The error is below:

 slurmstepd: error: Detected 1 oom_kill event in StepId=4893832.0. Some of the step tasks have been OOM Killed.
 srun: error: x1008c5s2b0n0: task 108: Out Of Memory

Here is the sampling portion that is creating the out-of-memory error:

incflo.post_processing                =  box_lr 

box_lr.output_format    = netcdf
box_lr.output_frequency = 4
box_lr.fields           = velocity
box_lr.labels           = Low1 

box_lr.Low1.type         = PlaneSampler
box_lr.Low1.num_points   = 1451 929
box_lr.Low1.origin       = -985.0000 -10045.0000 5.0000
box_lr.Low1.axis1        = 29000.0000 0.0 0.0
box_lr.Low1.axis2        = 0.0 18560.0000 0.0
box_lr.Low1.normal       = 0.0 0.0 1.0
box_lr.Low1.offsets      = 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0 280.0 300.0 320.0 340.0 360.0 380.0 400.0 420.0 440.0 460.0 480.0 500.0 520.0 540.0 560.0 580.0

I understand I'm asking for ~43M points, which will come with performance slowdowns. I can deal with slowdown, but I find it weird it is crashing. The memory footprint of u, v, and w of all these values should be about 1GB.

Things I tried (none worked):

  • AMR-Wind compiled with GNU
  • AMR-Wind compiled with Intel (oneapi classic)
  • AMR-Wind at different commit points (not the very latest, though-- note my offset keyword hasn't changed yet.)
  • native instead of netcdf
  • Split the large sampling into smaller ones (keeping the total sampling the same)
  • Runs with different number of nodes (all the way to 50)

Some observations:

  • If I comment out about half of the offsets and request >=4 nodes, it works.
    • If I request very few nodes (2 or 3), it crashes in the first time step (around the temperature_solve), before any actual sampling is about to take place.
    • If I request more nodes, (>=4), the main time loop starts, and it crashes on the very first time step where such sampling is happening (in the example above, the 4th).
  • If I leave all the offsets there, it crashes right after the Creating SamplerBase instance: PlaneSampler message.

Edit: Adding the input files for reproducibility
static_box.txt
setup_seagreen_prec_neutral.startAt0.i.txt

With a run that completes, what's the output of a build with this flag switch to on https://github.com/Exawind/amr-wind/blob/main/cmake/set_amrex_options.cmake#L19 ?

Here it is. This run takes about half of the offsets in the list above. I also added the tiny profiler and can share the results if useful.

Pinned Memory Usage:
------------------------------------------------------------------------------------------------------------------------------------------------                                      
Name                                                       Nalloc  Nfree  AvgMem min  AvgMem avg  AvgMem max  MaxMem min  MaxMem avg  MaxMem max
------------------------------------------------------------------------------------------------------------------------------------------------
The_Pinned_Arena::Initialize()                                312    312      60   B     119   B     153   B    8192 KiB    8192 KiB    8192 KiB
amr-wind::PlaneAveragingFine::compute_averages               6864   6864       7   B       7   B       7   B    3072   B    3072   B    3072   B
amr-wind::VelPlaneAveragingFine::compute_hvelmag_averages   10296  10296       4   B       4   B       4   B    3072   B    3072   B    3072   B
amr-wind::PlaneAveraging::compute_averages                  10296  10296       0   B       1   B       2   B     768   B     768   B     768   B
amr-wind::VelPlaneAveraging::compute_hvelmag_averages        3432   3432       0   B       0   B       0   B     256   B     256   B     256   B
------------------------------------------------------------------------------------------------------------------------------------------------

Just realized that a memlog was created

Final Memory Profile Report Across Processes:
      | Name            |       Current      |   High Water Mark  |
      |-----------------+--------------------+--------------------|
      | Fab             |     0 ... 0      B |  1496 ... 1869  MB |
      | MemPool         |  8192 ... 8192  KB |  8192 ... 8192  KB |
      | BoxArrayHash    |     0 ... 0      B |  4560 ... 4569  KB |
      | BoxArray        |     0 ... 0      B |  3896 ... 3896  KB |
      |-----------------+--------------------+--------------------|
      | Total           |  8192 ... 8192  KB |                    |

      | Name            |      Current #     |  High Water Mark # |
      |-----------------+--------------------+--------------------|
      | BoxArray Innard |      0 ... 0       |     40 ... 40      |
      | MultiFab        |      0 ... 0       |    994 ... 999     |

 * Proc VmPeak          VmSize               VmHWM                VmRSS
   [ 5236 ... 8173  MB] [ 2175 ... 3136  MB] [ 4553 ... 7116  MB] [ 1677 ... 2639  MB]

 * Node total           free                 free+buffers+cached  shared
   [  250 ... 250   GB] [  122 ... 135   GB] [  142 ... 145   GB] [  534 ... 641   MB]

I got the same error message for a Test case with 360 million grid points on 6 Kestrel Nodes (using 96 cores in each). The case is similar to the one Regis submitted.

The failure seems to be happening only on CPU. I tried running the same simulations on GPU and the simulations ran without any issues.

I did not have success on the GPU. Time per time step increase five-fold on 2 GPU nodes; and case still crashed OOM on a single GPU node.

I tried a larger case and OOM happened on both CPU and GPU.

Ganesh has tried it as well with his exawind-manager build that includes some of the extra HDF5 flags. He tried with 8 and 100 nodes. No luck, same error.

I will be looking into this with the case files @rthedin gave me. Hopefully this week.

I got it to work using 8 GPUs (2 nodes on kestrel). If I use less, I can see the memory creeping up, followed by a crash. Each GPU has about 80GB of memory

Some preliminary data that Jon and I were looking at:

Case:

No AMR, ABL case, not very big:

  Level 0   375 grids  12288000 cells  100 % of domain
            smallest grid: 32 x 32 x 32  biggest grid: 32 x 32 x 32

Running on 1 Kestrel node, 104 ranks, 250GB of RAM, Intel build.

Sampling section of the input file is the interesting part:

incflo.post_processing                = box_lr

# ---- Low-res sampling parameters ----                                                                                                 # box_lr.output_format    = netcdf
box_lr.output_format    = native                                                                                                        box_lr.output_frequency = 2
box_lr.fields           = velocity                                                                                                      box_lr.labels           = Low
                                                                                                                                        # Low sampling grid spacing = 20 m
box_lr.Low.type         = PlaneSampler                                                                                                  box_lr.Low.num_points   = 1451 929
box_lr.Low.origin       = -985.0000 -10045.0000 5.0000                                                                                  box_lr.Low.axis1        = 29000.0000 0.0 0.0
box_lr.Low.axis2        = 0.0 18560.0000 0.0
box_lr.Low.normal       = 0.0 0.0 1.0
box_lr.Low.offsets      = 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0 280.0 300.0 320.0 340.0 360.0 380.0 400.0 420.0 440.0 460.0 480.0 500.0 520.0 540.0 560.0 580.0

I ran with 0, 1, 2, and 4 sampling planes so I could get a simulation that completes. 4 time steps total, with a sampling frequency of 2.

Results

no sampling

memory_0plane

1 plane

memory_1plane

2 planes

memory_2plane

4 planes

memory_4plane

Conclusions

  • with all the sampling planes, this causes an OOM
  • with no sampling, AMR-Wind is using about 200 MB * 104 ranks = 20GB which works out to 1.7kb/cell. Is that reasonable? idk, maybe if I thought about it enough.
  • Each additional sampling plane adds about 100MB of RAM usage per rank. Naively I would expect (1451*929) particles *((3+3) double fields *8 bytes + 2 int fields * 4 bytes)/(1024 bytes*1024bytes) = 75Mb needed per plane, total. But not per rank. Unclear why all the ranks need that much extra memory.
  • Rank 0 doing IO (? I think, need to confirm) is clearly visible. Or it is creating the particles and then calling redistribute to the other ranks (which would explain the time delay of the spike on rank 0 and then the other ranks' memory increasing).

Next steps

  • Validate/invalidate hypothesis: particles created on rank 0 and then redistributed should be created on all ranks at the same time, in parallel (I've done this in other projects)

  • Validate/invalidate hypothesis: the mismatch in my naive understanding of how much memory a particle needs and how much it actually needs is due to my lack of understanding of how much data a particle needs. Maybe we are adding more data fields to the particles than they actually need.

  • Run through a "real" profiler to get finer grained metrics.

Ok I think I understand why "native" is not behaving the way I would expect. And I think I know why netcdf IO is using so much memory. Each rank is carrying m_output_buf and m_sample_buf of size nparticles * nvars. This is totally unnecessary for the native IO. And probably way to big for the netcdf IO. My first step is going to be about making the native IO behave as I expect it to behave. Then deal with netcdf.

Got some good news, at least on the native side:

4 planes, current amr-wind

memory_4plane

4 planes, #1207

memory_4plane_new

30 planes, my branch, not possible with current amr-wind

memory_30plane_new

Conclusions

  • for 4 planes, most ranks just use around 200MB, which is probably just slightly more than with no planes
  • there is a rank using much more memory in spikes. That has to do with how we do particle init (I think. Thought it doesn't explain everything). I will be working on this next.
  • netcdf is still going to be an issue but I have thoughts on how to fix that too

#1209 has another round of improvements. Repeating the conclusion of that PR here:

- This PR removes (over 2 time steps), 3 of the 4 huge memory spikes on rank 0.
- Instead of 40GB of memory spike, it is now only a single 10GB memory spike, or a 4x improvement
- We also got a speedup: 4.25X per time step (2X over the total run time, 1.7 for init)

This should help the native and netcdf samplers.

This issue is stale because it has been open 30 days with no activity.

I keep seeing quite a high memory consumption for large data sampling using netcdf sampling. From my feeling there is no significant improvement compared to the older versions.

Has the solution of this issue already been validated with netcdf sampling? Or am I just sampling too much data?

Hi, Thank you for reaching out. Improving the memory consumption of the samplers is an ongoing effort. Right now our focus is on improving the memory consumption of the native pathway for the samplers since they are more performant to begin with (and improvements there impact the netcdf samplers at the same time). So my recommendation has been to encourage users to use that pathway. There are example scripts in the tools directory for manipulating the resulting data with python.