🐛[BUG]: Failed to allocate memory for requested buffer of size 1851310080
melodicdeath opened this issue · 2 comments
Version
source - main
On which installation method(s) does this occur?
Pip
Describe the issue
I run the example 02_model_comparison:
print("Running Pangu inference")
pangu_ds = inference_ensemble.run_basic_inference(
pangu_inference_model,
n=24, # Note we run 24 steps here because Pangu is at 6 hour dt (6 day forecast)
data_source=pangu_data_source,
time=time,
)
pangu_ds.to_netcdf(f"{output_dir}/pangu_inference_out.nc")
print(pangu_ds)
RuntimeError Traceback (most recent call last)
in <cell line: 2>()
1 print("Running Pangu inference")
----> 2 pangu_ds = inference_ensemble.run_basic_inference(
3 pangu_inference_model,
4 n=24, # Note we run 24 steps here because Pangu is at 6 hour dt (6 day forecast)
5 data_source=pangu_data_source,
5 frames
/usr/local/lib/python3.10/dist-packages/earth2mip/inference_ensemble.py in run_basic_inference(model, n, data_source, time)
284 arrays = []
285 times = []
--> 286 for k, (time, data, _) in enumerate(model(time, x)):
287 arrays.append(data.cpu().numpy())
288 times.append(time)
/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in call(self, time, x, normalize, restart)
247 dt = torch.tensor(self.time_step.total_seconds())
248 x1 += self.source(x1, time1) * dt
--> 249 x1 = self.model_6(x1)
250 yield time1, x1, restart_data
251
/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in call(self, x)
142
143 def call(self, x):
--> 144 return self.forward(x)
145
146 def to(self):
/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in forward(self, x)
156 pl = pl.resize(*pl_shape)
157 sl = surface[0]
--> 158 plo, slo = self.model(pl, sl)
159 return torch.cat(
160 [
/usr/local/lib/python3.10/dist-packages/earth2mip/networks/pangu.py in call(self, fields_pl, fields_sfc)
122 output = bind_output("output", like=fields_pl)
123 output_sfc = bind_output("output_surface", like=fields_sfc)
--> 124 self.ort_session.run_with_iobinding(binding)
125 return output, output_sfc
126
/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py in run_with_iobinding(self, iobinding, run_options)
329 :param run_options: See :class:onnxruntime.RunOptions
.
330 """
--> 331 self._sess.run_with_iobinding(iobinding._iobinding, run_options)
332
333 def get_tuning_results(self):
RuntimeError: Error in execution: Non-zero status code returned while running BiasSoftmax node. Name:'BiasSoftmax' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1851310080
I don't know what went wrong? But I used the same environment to try directly loading pangu_weather_6.onnx and inference,the results are normal.
Environment details
Kaggle,GPU T4 * 2
!pip install ort-nightly-gpu --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-12-nightly/pypi/simple/
$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 49C P0 26W / 70W | 13623MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla T4 Off | 00000000:00:05.0 Off | 0 |
| N/A 38C P8 9W / 70W | 3MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Sorry,it's not a bug.
- Install optional dependencies for Pangu weather:
$ pip install .[pangu] - changed n from 24 to 12
- only load pangu_weather_6.onnx
pangu.load_6(package)
Then that's it.
Thanks for the update. I'll close this then.