NVIDIA/earth2mip

🐛[BUG]: Failed to allocate memory for requested buffer of size 1851310080

melodicdeath opened this issue · 2 comments

Version

source - main

On which installation method(s) does this occur?

Pip

Describe the issue

I run the example 02_model_comparison:

print("Running Pangu inference")
pangu_ds = inference_ensemble.run_basic_inference(
pangu_inference_model,
n=24, # Note we run 24 steps here because Pangu is at 6 hour dt (6 day forecast)
data_source=pangu_data_source,
time=time,
)
pangu_ds.to_netcdf(f"{output_dir}/pangu_inference_out.nc")
print(pangu_ds)


RuntimeError Traceback (most recent call last)
in <cell line: 2>()
1 print("Running Pangu inference")
----> 2 pangu_ds = inference_ensemble.run_basic_inference(
3 pangu_inference_model,
4 n=24, # Note we run 24 steps here because Pangu is at 6 hour dt (6 day forecast)
5 data_source=pangu_data_source,

5 frames
/usr/local/lib/python3.10/dist-packages/earth2mip/inference_ensemble.py in run_basic_inference(model, n, data_source, time)
284 arrays = []
285 times = []
--> 286 for k, (time, data, _) in enumerate(model(time, x)):
287 arrays.append(data.cpu().numpy())
288 times.append(time)

/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in call(self, time, x, normalize, restart)
247 dt = torch.tensor(self.time_step.total_seconds())
248 x1 += self.source(x1, time1) * dt
--> 249 x1 = self.model_6(x1)
250 yield time1, x1, restart_data
251

/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in call(self, x)
142
143 def call(self, x):
--> 144 return self.forward(x)
145
146 def to(self):

/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in forward(self, x)
156 pl = pl.resize(*pl_shape)
157 sl = surface[0]
--> 158 plo, slo = self.model(pl, sl)
159 return torch.cat(
160 [

/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in call(self, fields_pl, fields_sfc)
122 output = bind_output("output", like=fields_pl)
123 output_sfc = bind_output("output_surface", like=fields_sfc)
--> 124 self.ort_session.run_with_iobinding(binding)
125 return output, output_sfc
126

/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py in run_with_iobinding(self, iobinding, run_options)
329 :param run_options: See :class:onnxruntime.RunOptions.
330 """
--> 331 self._sess.run_with_iobinding(iobinding._iobinding, run_options)
332
333 def get_tuning_results(self):

RuntimeError: Error in execution: Non-zero status code returned while running BiasSoftmax node. Name:'BiasSoftmax' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1851310080

I don't know what went wrong? But I used the same environment to try directly loading pangu_weather_6.onnx and inference,the results are normal.

Environment details

Kaggle,GPU T4 * 2

!pip install ort-nightly-gpu --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-12-nightly/pypi/simple/

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   49C    P0             26W /   70W |   13623MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00000000:00:05.0 Off |                    0 |
| N/A   38C    P8              9W /   70W |       3MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Sorry,it's not a bug.

  1. Install optional dependencies for Pangu weather:
    $ pip install .[pangu]
  2. changed n from 24 to 12
  3. only load pangu_weather_6.onnx
    pangu.load_6(package)

Then that's it.

Thanks for the update. I'll close this then.