RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Question

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

nagolinc opened this issue 7 months ago · 4 comments

nagolinc commented 7 months ago

when running 7b model on wsl2 (ubuntu)

root@loganrtx:~/chameleon# python -m chameleon.miniviewer

torch is using cuda 12.1

import torch
torch.cuda.is_available()
True
print(torch.version.cuda)
12.1

I'm getting the following error:

...

Process SpawnProcess-2:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/anaconda3/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/chameleon/chameleon/inference/chameleon.py", line 509, in _worker_impl
for token in Generator(
File "/root/chameleon/chameleon/inference/chameleon.py", line 403, in next
piece = next(self.dyngen)
File "/root/chameleon/chameleon/inference/utils.py", line 20, in next
return next(self.gen)
File "/root/chameleon/chameleon/inference/chameleon.py", line 279, in next
tok = next(self.gen)
File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/chameleon/chameleon/inference/generation.py", line 91, in next
next_tokens = self.token_selector(
File "/root/chameleon/chameleon/inference/token_selector.py", line 31, in call
return probs.multinomial(num_samples=1).squeeze(1)
File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in torch_function
return func(*args, **kwargs)
RuntimeError: probability tensor contains either inf, nan or element < 0
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Answer 1 · 2024-06-20T17:17:30.000Z

Can you share more information. Output of nvidia-smi. Generation inputs. ...

Answer 2 · 2024-06-20T17:25:15.000Z

inputs

nvidia-smi

^C(base) root@loganrtx:~/chameleon# nvidia-smi
Thu Jun 20 13:24:15 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.00                 Driver Version: 560.38         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0  On |                  N/A |
|  0%   47C    P8             46W /  370W |    2106MiB /  24576MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        39      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

Answer 3 · 2024-06-21T18:36:42.000Z

I'm not able to reproduce this. I'm assuming it's maybe running out of memory since the model without activation or cache is using 2.1GB out 2.45GB on the RTX 3090.
@jacobkahn Any ideas?

Answer 4 · 2024-06-21T19:17:27.000Z

yeah, it looks like its OOM