shyamsn97/mario-gpt

RuntimeError: invalid multinomial distribution

TheFiZi opened this issue · 13 comments

Using the prompts: many pipes, many enemies, no blocks, low elevation

shape: torch.Size([1, 673]), torch.Size([1, 1304]) first: 56, last: 88:  93%|██████████████████████████████████████████████████████████████▎    | 1303/1400 [02:43<00:12,  7.97it/s]Traceback (most recent call last):
  File "/home/me/apps/mariogpt/capturePlay.py", line 38, in <module>
    generated_level = mario_lm.sample(
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/mario_gpt/lm/gpt.py", line 54, in sample
    return sampler(
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/mario_gpt/sampler.py", line 248, in __call__
    return self.sample(*args, **kwargs)
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/mario_gpt/sampler.py", line 223, in sample
    next_tokens, encoder_hidden_states = self.step(
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/mario_gpt/sampler.py", line 172, in step
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)

Oh interesting, what temperature value are you using?

Oh interesting, what temperature value are you using?

Defaults

generated_level = mario_lm.sample(
    prompts=prompts,
    num_steps=1400,
    temperature=2.0,
    use_tqdm=True
)

How frequently does this happen? I haven’t really seen this but it seems like some of the logit values are nans

This is the first time in ~30-40 runs. Could just be something I'm doing wrong to be honest. I can let you know if I see it again. Is there more I can capture that would be helpful if it happens again?

Not sure actually haha, never really encountered this, especially with temperature 2.0. Maybe a torch update is needed? What version are you using rn?

I am using

Name: torch
Version: 1.13.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /home/sa_schewee/venv/thegoose/lib/python3.10/site-packages
Requires: nvidia-cublas-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-runtime-cu11, nvidia-cudnn-cu11, typing-extensions
Required-by: mario-gpt

There does not appear to be a newer version available:

(thegoose) me@nightshade:~$ pip3 install --upgrade torch
Requirement already satisfied: torch in ./venv/thegoose/lib/python3.10/site-packages (1.13.1)
Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in ./venv/thegoose/lib/python3.10/site-packages (from torch) (8.5.0.96)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in ./venv/thegoose/lib/python3.10/site-packages (from torch) (11.7.99)
Requirement already satisfied: typing-extensions in ./venv/thegoose/lib/python3.10/site-packages (from torch) (4.5.0)
Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in ./venv/thegoose/lib/python3.10/site-packages (from torch) (11.7.99)
Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in ./venv/thegoose/lib/python3.10/site-packages (from torch) (11.10.3.66)
Requirement already satisfied: setuptools in ./venv/thegoose/lib/python3.10/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch) (59.6.0)
Requirement already satisfied: wheel in ./venv/thegoose/lib/python3.10/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch) (0.38.4)

Yeah I feel like this and #12 might be related somehow, could be some weird cuda issue. I’ll look into reproducing this, but I hope this doesn’t happen too frequently for you

Sadly, 3 times in a row today. I haven't had a successful run yet today.

And does it only happen to you on gpu? Or is it both cpu and gpu?

No issues with CPUs so far and have generated multiple images successfully. I wonder if it's a RAM issue. My GPU only has 2GB and my VM has 4GB.

I have access to a 3080. Will do some tests with that and see if I can replicate the problem.

I’m testing with my laptop right now which has a Quadro (4gb), and it seems to be running fine. Quite strange haha

I am going to close this off as a not enough memory issue. I ran the default generation example and it peaked at ~6GB of vRAM.

The Quadro I was running it on only has 2GB.