Out-of-memory crash with large sampling
Opened this issue · 17 comments
static_box.txt
AMR-Wind crashes on Kestrel CPU nodes with out-of-memory error when large sampling is requested. The error is below:
slurmstepd: error: Detected 1 oom_kill event in StepId=4893832.0. Some of the step tasks have been OOM Killed.
srun: error: x1008c5s2b0n0: task 108: Out Of Memory
Here is the sampling portion that is creating the out-of-memory error:
incflo.post_processing = box_lr
box_lr.output_format = netcdf
box_lr.output_frequency = 4
box_lr.fields = velocity
box_lr.labels = Low1
box_lr.Low1.type = PlaneSampler
box_lr.Low1.num_points = 1451 929
box_lr.Low1.origin = -985.0000 -10045.0000 5.0000
box_lr.Low1.axis1 = 29000.0000 0.0 0.0
box_lr.Low1.axis2 = 0.0 18560.0000 0.0
box_lr.Low1.normal = 0.0 0.0 1.0
box_lr.Low1.offsets = 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0 280.0 300.0 320.0 340.0 360.0 380.0 400.0 420.0 440.0 460.0 480.0 500.0 520.0 540.0 560.0 580.0
I understand I'm asking for ~43M points, which will come with performance slowdowns. I can deal with slowdown, but I find it weird it is crashing. The memory footprint of u, v, and w of all these values should be about 1GB.
Things I tried (none worked):
- AMR-Wind compiled with GNU
- AMR-Wind compiled with Intel (oneapi classic)
- AMR-Wind at different commit points (not the very latest, though-- note my
offset
keyword hasn't changed yet.) native
instead ofnetcdf
- Split the large sampling into smaller ones (keeping the total sampling the same)
- Runs with different number of nodes (all the way to 50)
Some observations:
- If I comment out about half of the
offsets
and request >=4 nodes, it works.- If I request very few nodes (2 or 3), it crashes in the first time step (around the temperature_solve), before any actual sampling is about to take place.
- If I request more nodes, (>=4), the main time loop starts, and it crashes on the very first time step where such sampling is happening (in the example above, the 4th).
- If I leave all the offsets there, it crashes right after the
Creating SamplerBase instance: PlaneSampler
message.
Edit: Adding the input files for reproducibility
static_box.txt
setup_seagreen_prec_neutral.startAt0.i.txt
With a run that completes, what's the output of a build with this flag switch to on https://github.com/Exawind/amr-wind/blob/main/cmake/set_amrex_options.cmake#L19 ?
Here it is. This run takes about half of the offsets
in the list above. I also added the tiny profiler and can share the results if useful.
Pinned Memory Usage:
------------------------------------------------------------------------------------------------------------------------------------------------
Name Nalloc Nfree AvgMem min AvgMem avg AvgMem max MaxMem min MaxMem avg MaxMem max
------------------------------------------------------------------------------------------------------------------------------------------------
The_Pinned_Arena::Initialize() 312 312 60 B 119 B 153 B 8192 KiB 8192 KiB 8192 KiB
amr-wind::PlaneAveragingFine::compute_averages 6864 6864 7 B 7 B 7 B 3072 B 3072 B 3072 B
amr-wind::VelPlaneAveragingFine::compute_hvelmag_averages 10296 10296 4 B 4 B 4 B 3072 B 3072 B 3072 B
amr-wind::PlaneAveraging::compute_averages 10296 10296 0 B 1 B 2 B 768 B 768 B 768 B
amr-wind::VelPlaneAveraging::compute_hvelmag_averages 3432 3432 0 B 0 B 0 B 256 B 256 B 256 B
------------------------------------------------------------------------------------------------------------------------------------------------
Just realized that a memlog
was created
Final Memory Profile Report Across Processes:
| Name | Current | High Water Mark |
|-----------------+--------------------+--------------------|
| Fab | 0 ... 0 B | 1496 ... 1869 MB |
| MemPool | 8192 ... 8192 KB | 8192 ... 8192 KB |
| BoxArrayHash | 0 ... 0 B | 4560 ... 4569 KB |
| BoxArray | 0 ... 0 B | 3896 ... 3896 KB |
|-----------------+--------------------+--------------------|
| Total | 8192 ... 8192 KB | |
| Name | Current # | High Water Mark # |
|-----------------+--------------------+--------------------|
| BoxArray Innard | 0 ... 0 | 40 ... 40 |
| MultiFab | 0 ... 0 | 994 ... 999 |
* Proc VmPeak VmSize VmHWM VmRSS
[ 5236 ... 8173 MB] [ 2175 ... 3136 MB] [ 4553 ... 7116 MB] [ 1677 ... 2639 MB]
* Node total free free+buffers+cached shared
[ 250 ... 250 GB] [ 122 ... 135 GB] [ 142 ... 145 GB] [ 534 ... 641 MB]
I got the same error message for a Test case with 360 million grid points on 6 Kestrel Nodes (using 96 cores in each). The case is similar to the one Regis submitted.
The failure seems to be happening only on CPU. I tried running the same simulations on GPU and the simulations ran without any issues.
I did not have success on the GPU. Time per time step increase five-fold on 2 GPU nodes; and case still crashed OOM on a single GPU node.
I tried a larger case and OOM happened on both CPU and GPU.
Ganesh has tried it as well with his exawind-manager build that includes some of the extra HDF5 flags. He tried with 8 and 100 nodes. No luck, same error.
I will be looking into this with the case files @rthedin gave me. Hopefully this week.
I got it to work using 8 GPUs (2 nodes on kestrel). If I use less, I can see the memory creeping up, followed by a crash. Each GPU has about 80GB of memory
Some preliminary data that Jon and I were looking at:
Case:
No AMR, ABL case, not very big:
Level 0 375 grids 12288000 cells 100 % of domain
smallest grid: 32 x 32 x 32 biggest grid: 32 x 32 x 32
Running on 1 Kestrel node, 104 ranks, 250GB of RAM, Intel build.
Sampling section of the input file is the interesting part:
incflo.post_processing = box_lr
# ---- Low-res sampling parameters ---- # box_lr.output_format = netcdf
box_lr.output_format = native box_lr.output_frequency = 2
box_lr.fields = velocity box_lr.labels = Low
# Low sampling grid spacing = 20 m
box_lr.Low.type = PlaneSampler box_lr.Low.num_points = 1451 929
box_lr.Low.origin = -985.0000 -10045.0000 5.0000 box_lr.Low.axis1 = 29000.0000 0.0 0.0
box_lr.Low.axis2 = 0.0 18560.0000 0.0
box_lr.Low.normal = 0.0 0.0 1.0
box_lr.Low.offsets = 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0 280.0 300.0 320.0 340.0 360.0 380.0 400.0 420.0 440.0 460.0 480.0 500.0 520.0 540.0 560.0 580.0
I ran with 0, 1, 2, and 4 sampling planes so I could get a simulation that completes. 4 time steps total, with a sampling frequency of 2.
Results
no sampling
1 plane
2 planes
4 planes
Conclusions
- with all the sampling planes, this causes an OOM
- with no sampling, AMR-Wind is using about
200 MB * 104 ranks = 20GB
which works out to 1.7kb/cell. Is that reasonable? idk, maybe if I thought about it enough. - Each additional sampling plane adds about 100MB of RAM usage per rank. Naively I would expect
(1451*929) particles *((3+3) double fields *8 bytes + 2 int fields * 4 bytes)/(1024 bytes*1024bytes) = 75Mb
needed per plane, total. But not per rank. Unclear why all the ranks need that much extra memory. - Rank 0 doing IO (? I think, need to confirm) is clearly visible. Or it is creating the particles and then calling redistribute to the other ranks (which would explain the time delay of the spike on rank 0 and then the other ranks' memory increasing).
Next steps
-
Validate/invalidate hypothesis: particles created on rank 0 and then redistributed should be created on all ranks at the same time, in parallel (I've done this in other projects)
-
Validate/invalidate hypothesis: the mismatch in my naive understanding of how much memory a particle needs and how much it actually needs is due to my lack of understanding of how much data a particle needs. Maybe we are adding more data fields to the particles than they actually need.
-
Run through a "real" profiler to get finer grained metrics.
Ok I think I understand why "native" is not behaving the way I would expect. And I think I know why netcdf IO is using so much memory. Each rank is carrying m_output_buf and m_sample_buf of size nparticles * nvars. This is totally unnecessary for the native IO. And probably way to big for the netcdf IO. My first step is going to be about making the native IO behave as I expect it to behave. Then deal with netcdf.
Got some good news, at least on the native side:
4 planes, current amr-wind
4 planes, #1207
30 planes, my branch, not possible with current amr-wind
Conclusions
- for 4 planes, most ranks just use around 200MB, which is probably just slightly more than with no planes
- there is a rank using much more memory in spikes. That has to do with how we do particle init (I think. Thought it doesn't explain everything). I will be working on this next.
- netcdf is still going to be an issue but I have thoughts on how to fix that too
#1209 has another round of improvements. Repeating the conclusion of that PR here:
- This PR removes (over 2 time steps), 3 of the 4 huge memory spikes on rank 0.
- Instead of 40GB of memory spike, it is now only a single 10GB memory spike, or a 4x improvement
- We also got a speedup: 4.25X per time step (2X over the total run time, 1.7 for init)
This should help the native and netcdf samplers.
This issue is stale because it has been open 30 days with no activity.
I keep seeing quite a high memory consumption for large data sampling using netcdf sampling. From my feeling there is no significant improvement compared to the older versions.
Has the solution of this issue already been validated with netcdf sampling? Or am I just sampling too much data?
Hi, Thank you for reaching out. Improving the memory consumption of the samplers is an ongoing effort. Right now our focus is on improving the memory consumption of the native pathway for the samplers since they are more performant to begin with (and improvements there impact the netcdf samplers at the same time). So my recommendation has been to encourage users to use that pathway. There are example scripts in the tools directory for manipulating the resulting data with python.