RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
nagolinc opened this issue · 4 comments
when running 7b model on wsl2 (ubuntu)
root@loganrtx:~/chameleon# python -m chameleon.miniviewer
torch is using cuda 12.1
import torch
torch.cuda.is_available()
True
print(torch.version.cuda)
12.1
I'm getting the following error:
...
Process SpawnProcess-2:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/anaconda3/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/chameleon/chameleon/inference/chameleon.py", line 509, in _worker_impl
for token in Generator(
File "/root/chameleon/chameleon/inference/chameleon.py", line 403, in next
piece = next(self.dyngen)
File "/root/chameleon/chameleon/inference/utils.py", line 20, in next
return next(self.gen)
File "/root/chameleon/chameleon/inference/chameleon.py", line 279, in next
tok = next(self.gen)
File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/chameleon/chameleon/inference/generation.py", line 91, in next
next_tokens = self.token_selector(
File "/root/chameleon/chameleon/inference/token_selector.py", line 31, in call
return probs.multinomial(num_samples=1).squeeze(1)
File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in torch_function
return func(*args, **kwargs)
RuntimeError: probability tensor contains either inf
, nan
or element < 0
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Can you share more information. Output of nvidia-smi. Generation inputs. ...
nvidia-smi
^C(base) root@loganrtx:~/chameleon# nvidia-smi
Thu Jun 20 13:24:15 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.00 Driver Version: 560.38 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 On | N/A |
| 0% 47C P8 46W / 370W | 2106MiB / 24576MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 39 G /Xwayland N/A |
+-----------------------------------------------------------------------------------------+
I'm not able to reproduce this. I'm assuming it's maybe running out of memory since the model without activation or cache is using 2.1GB out 2.45GB on the RTX 3090.
@jacobkahn Any ideas?
yeah, it looks like its OOM