facebookresearch/chameleon

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

nagolinc opened this issue · 4 comments

when running 7b model on wsl2 (ubuntu)

root@loganrtx:~/chameleon# python -m chameleon.miniviewer

torch is using cuda 12.1

import torch
torch.cuda.is_available()
True
print(torch.version.cuda)
12.1

I'm getting the following error:

...

Process SpawnProcess-2:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/anaconda3/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/chameleon/chameleon/inference/chameleon.py", line 509, in _worker_impl
for token in Generator(
File "/root/chameleon/chameleon/inference/chameleon.py", line 403, in next
piece = next(self.dyngen)
File "/root/chameleon/chameleon/inference/utils.py", line 20, in next
return next(self.gen)
File "/root/chameleon/chameleon/inference/chameleon.py", line 279, in next
tok = next(self.gen)
File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/chameleon/chameleon/inference/generation.py", line 91, in next
next_tokens = self.token_selector(
File "/root/chameleon/chameleon/inference/token_selector.py", line 31, in call
return probs.multinomial(num_samples=1).squeeze(1)
File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in torch_function
return func(*args, **kwargs)
RuntimeError: probability tensor contains either inf, nan or element < 0
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Can you share more information. Output of nvidia-smi. Generation inputs. ...

inputs
image

6303

nvidia-smi

^C(base) root@loganrtx:~/chameleon# nvidia-smi
Thu Jun 20 13:24:15 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.00                 Driver Version: 560.38         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0  On |                  N/A |
|  0%   47C    P8             46W /  370W |    2106MiB /  24576MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        39      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

I'm not able to reproduce this. I'm assuming it's maybe running out of memory since the model without activation or cache is using 2.1GB out 2.45GB on the RTX 3090.
@jacobkahn Any ideas?

yeah, it looks like its OOM