CUDA out of memory

Question

CUDA out of memory

loretoparisi opened this issue 3 years ago · 8 comments

I get a OOM when loading the upsample model:

options_up = model_and_diffusion_defaults_upsampler()
options_up['use_fp16'] = has_cuda
options_up['timestep_respacing'] = 'fast27' # use 27 diffusion steps for very fast sampling
model_up, diffusion_up = create_model_and_diffusion(**options_up)
model_up.eval()
if has_cuda:
    model_up.convert_to_fp16()
model_up.to(device)
model_up.load_state_dict(load_checkpoint('upsample', device))
print('total upsampler parameters', sum(x.numel() for x in model_up.parameters()))

the allocation error was

RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 3.94 GiB total capacity; 3.00 GiB already allocated; 30.94 MiB free; 3.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

mynvidia-smiis

loreto@ombromanto:~/Projects/glide-text2im$ nvidia-smi
Wed Dec 22 20:39:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0  On |                  N/A |
| 45%   23C    P5    N/A /  75W |   3994MiB /  4033MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1094      G   /usr/lib/xorg/Xorg                121MiB |
|    0   N/A  N/A      1926      G   /usr/bin/gnome-shell               26MiB |
|    0   N/A  N/A      3532      G   ...AAAAAAAA== --shared-files       22MiB |
|    0   N/A  N/A      4795      C   /usr/bin/python                  3819MiB |
+-----------------------------------------------------------------------------+

Answer 1 · 2021-12-22T20:13:23.000Z

I haven't tested this code with less than 16GB of GPU memory, but this is a bit surprising since each model is roughly 400M parameters and therefore around 800MB of memory.

One suggestion: try loading the checkpoint on CPU, and then moving to GPU, like so:

options_up = model_and_diffusion_defaults_upsampler()
options_up['use_fp16'] = has_cuda
options_up['timestep_respacing'] = 'fast27' # use 27 diffusion steps for very fast sampling
model_up, diffusion_up = create_model_and_diffusion(**options_up)
model_up.load_state_dict(load_checkpoint('upsample', th.device('cpu')))
model_up.eval()
if has_cuda:
    model_up.convert_to_fp16()
model_up.to(device)
print('total upsampler parameters', sum(x.numel() for x in model_up.parameters()))

Answer 2 · 2021-12-22T20:37:44.000Z

I haven't tested this code with less than 16GB of GPU memory, but this is a bit surprising since each model is roughly 400M parameters and therefore around 800MB of memory.

One suggestion: try loading the checkpoint on CPU, and then moving to GPU, like so:
options_up = model_and_diffusion_defaults_upsampler()
options_up['use_fp16'] = has_cuda
options_up['timestep_respacing'] = 'fast27' # use 27 diffusion steps for very fast sampling
model_up, diffusion_up = create_model_and_diffusion(**options_up)
model_up.load_state_dict(load_checkpoint('upsample', th.device('cpu')))
model_up.eval()
if has_cuda:
    model_up.convert_to_fp16()
model_up.to(device)
print('total upsampler parameters', sum(x.numel() for x in model_up.parameters()))

Amazing, thanks! In fact I read total upsampler parameters 398361286.
Using the CPU trick it worked with the 4GB GTX 1050. Also it took few seconds to generate this curious dog

Maybe this approach could be a guideline in the docs...

Answer 3 · 2022-02-16T21:45:28.000Z

I still get the following error even when trying the above code
RuntimeError: CUDA out of memory. Tried to allocate 5.27 GiB (GPU 0; 11.77 GiB total capacity; 6.51 GiB already allocated; 1.50 GiB free; 7.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Answer 4 · 2022-02-17T09:15:09.000Z

It looks like you had a 12-GB-VRAM GPU, while OP managed to work with a 4-GB-VRAM GPU.

Try clearing the memory, e.g. by restarting the Colab session if you use Google Colab.

Answer 5 · 2022-02-17T23:32:44.000Z

Hi @woctezuma . Thank you for your suggestion. I am running the code locally. I am not sure how to clear the memory in the code.
Also one thing to note, I am using the 'Inpaint' script:

from typing import Tuple
from PIL import Image
from datetime import datetime

import numpy as np
import torch as th
import torch.nn.functional as F

from glide_text2im.download import load_checkpoint
from glide_text2im.model_creation import (
    create_model_and_diffusion,
    model_and_diffusion_defaults,
    model_and_diffusion_defaults_upsampler
)


def read_image(path: str, size: int = 256) -> Tuple[th.Tensor, th.Tensor]:
    pil_img = Image.open(path).convert('RGB')
    pil_img = pil_img.resize((size, size), resample=Image.BICUBIC)
    img = np.array(pil_img)
    return th.from_numpy(img)[None].permute(0, 3, 1, 2).float() / 127.5 - 1


#########################################
# Sampling parameters
prompt = "a evil cyborg"
batch_size = 16
guidance_scale = 5.0

# Tune this parameter to control the sharpness of 256x256 images.
# A value of 1.0 is sharper, but sometimes results in grainy artifacts.
upsample_temp = 0.997

# Source image we are inpainting
source_image_256 = read_image('notebooks/img_219.png', size=256)
source_image_64 = read_image('notebooks/img_219.png', size=64)

# The mask should always be a boolean 64x64 mask, and then we
# can upsample it for the second stage.
source_mask_64 = th.ones_like(source_image_64)[:, :1]
source_mask_64[:, :, 20:] = 0
source_mask_256 = F.interpolate(source_mask_64, (256, 256), mode='nearest')
#########################################


#########################################
# This notebook supports both CPU and GPU.
# On CPU, generating one sample may take on the order of 20 minutes.
# On a GPU, it should be under a minute.
has_cuda = th.cuda.is_available()
device = th.device('cpu' if not has_cuda else 'cuda')

# Make a filename
xprompt = prompt.replace(" ", "_")[:] + "-gs_" + str(guidance_scale)

# Create base model.
options = model_and_diffusion_defaults()
options['inpaint'] = True
options['use_fp16'] = has_cuda
options['timestep_respacing'] = '100'  # use 100 diffusion steps for fast sampling
model, diffusion = create_model_and_diffusion(**options)
model.eval()
if has_cuda:
    model.convert_to_fp16()
model.to(device)
model.load_state_dict(load_checkpoint('base-inpaint', device))
print('total base parameters', sum(x.numel() for x in model.parameters()))

# Create upsampler model.
options_up = model_and_diffusion_defaults_upsampler()
options_up['inpaint'] = True
options_up['use_fp16'] = has_cuda
options_up['timestep_respacing'] = 'fast27'  # use 27 diffusion steps for very fast sampling
model_up, diffusion_up = create_model_and_diffusion(**options_up)
model_up.eval()
if has_cuda:
    model_up.convert_to_fp16()
model_up.to(device)
model_up.load_state_dict(load_checkpoint('upsample-inpaint', device))
print('total upsampler parameters', sum(x.numel() for x in model_up.parameters()))


def save_images(batch: th.Tensor):
    """ Save images """
    scaled = ((batch + 1) * 127.5).round().clamp(0, 255).to(th.uint8).cpu()
    reshaped = scaled.permute(2, 0, 3, 1).reshape([batch.shape[2], -1, 3])

    # Save strip
    stamp = datetime.today().strftime('%H%M%S')
    Image.fromarray(reshaped.numpy()).save(f'output-{stamp}.png')

    # Save individual
    for _ in range(0, batch.shape[0]):
        test_single = scaled.select(0, _)
        test_reshape = test_single.permute(1, 2, 0).reshape([batch.shape[2], -1, 3])
        Image.fromarray(test_reshape.numpy()).save(f'{xprompt}-{_}-{stamp}.png')


# Visualise the image we are inpainting - if you want to, uncomment
# save_images(source_image_256 * source_mask_256)


##############################
# Sample from the base model #
##############################

# Create the text tokens to feed to the model.
tokens = model.tokenizer.encode(prompt)
tokens, mask = model.tokenizer.padded_tokens_and_mask(
    tokens, options['text_ctx']
)

# Create the classifier-free guidance tokens (empty)
full_batch_size = batch_size * 2
uncond_tokens, uncond_mask = model.tokenizer.padded_tokens_and_mask(
    [], options['text_ctx']
)

# Pack the tokens together into model kwargs.
model_kwargs = dict(
    tokens=th.tensor(
        [tokens] * batch_size + [uncond_tokens] * batch_size, device=device
    ),
    mask=th.tensor(
        [mask] * batch_size + [uncond_mask] * batch_size,
        dtype=th.bool,
        device=device,
    ),

    # Masked inpainting image
    inpaint_image=(source_image_64 * source_mask_64).repeat(full_batch_size, 1, 1, 1).to(device),
    inpaint_mask=source_mask_64.repeat(full_batch_size, 1, 1, 1).to(device),
)


# Create an classifier-free guidance sampling function
def model_fn(x_t, ts, **kwargs):
    half = x_t[: len(x_t) // 2]
    combined = th.cat([half, half], dim=0)
    model_out = model(combined, ts, **kwargs)
    eps, rest = model_out[:, :3], model_out[:, 3:]
    cond_eps, uncond_eps = th.split(eps, len(eps) // 2, dim=0)
    half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
    eps = th.cat([half_eps, half_eps], dim=0)
    return th.cat([eps, rest], dim=1)


def denoised_fn(x_start):
    # Force the model to have the exact right x_start predictions
    # for the part of the image which is known.
    return (
            x_start * (1 - model_kwargs['inpaint_mask'])
            + model_kwargs['inpaint_image'] * model_kwargs['inpaint_mask']
    )


# Sample from the base model.
model.del_cache()
samples = diffusion.p_sample_loop(
    model_fn,
    (full_batch_size, 3, options["image_size"], options["image_size"]),
    device=device,
    clip_denoised=True,
    progress=True,
    model_kwargs=model_kwargs,
    cond_fn=None,
    denoised_fn=denoised_fn,
)[:batch_size]
model.del_cache()

# 64x64 output - not worth saving, but uncomment if you like!
# save_images(samples)

##############################
# Upsample the 64x64 samples #
##############################

tokens = model_up.tokenizer.encode(prompt)
tokens, mask = model_up.tokenizer.padded_tokens_and_mask(
    tokens, options_up['text_ctx']
)

# Create the model conditioning dict.
model_kwargs = dict(
    # Low-res image to upsample.
    low_res=((samples + 1) * 127.5).round() / 127.5 - 1,

    # Text tokens
    tokens=th.tensor(
        [tokens] * batch_size, device=device
    ),
    mask=th.tensor(
        [mask] * batch_size,
        dtype=th.bool,
        device=device,
    ),

    # Masked inpainting image.
    inpaint_image=(source_image_256 * source_mask_256).repeat(batch_size, 1, 1, 1).to(device),
    inpaint_mask=source_mask_256.repeat(batch_size, 1, 1, 1).to(device),
)


def denoised_fn(x_start):
    # Force the model to have the exact right x_start predictions
    # for the part of the image which is known.
    return (
            x_start * (1 - model_kwargs['inpaint_mask'])
            + model_kwargs['inpaint_image'] * model_kwargs['inpaint_mask']
    )


# Sample from the base model.
model_up.del_cache()
up_shape = (batch_size, 3, options_up["image_size"], options_up["image_size"])
up_samples = diffusion_up.p_sample_loop(
    model_up,
    up_shape,
    noise=th.randn(up_shape, device=device) * upsample_temp,
    device=device,
    clip_denoised=True,
    progress=True,
    model_kwargs=model_kwargs,
    cond_fn=None,
    denoised_fn=denoised_fn,
)[:batch_size]
model_up.del_cache()

# Show the output
save_images(up_samples)

Setting the batch size to greater than 16 will result in an OOM error.
here is my nvidia-smi:

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3555 G /usr/lib/xorg/Xorg 643MiB |
| 0 N/A N/A 4427 G /usr/bin/gnome-shell 185MiB |
| 0 N/A N/A 9394 G ...275038181146881444,131072 23MiB |
| 0 N/A N/A 14262 G telegram-desktop 2MiB |
| 0 N/A N/A 14661 G /usr/lib/firefox/firefox 204MiB |
| 0 N/A N/A 16530 G gnome-control-center 2MiB |
| 0 N/A N/A 16806 G .../debug.log --shared-files 21MiB |
| 0 N/A N/A 24302 G ..._24179.log --shared-files 13MiB |
+-----------------------------------------------------------------------------+

Answer 6 · 2022-02-19T14:39:56.000Z

I ran into the same issue on a 4GB GPU. Oddly enough, loading the upsample model before the base model worked for me with no other changes.

Answer 7 · 2022-02-24T22:13:34.000Z

@kgullion Thanks, I will try your suggestion!

Answer 8 · 2022-02-25T01:12:29.000Z

@kgullion thanks for your suggestion. But I ave tried to run the upscale model before the base model and I still get the following error:

RuntimeError: CUDA out of memory. Tried to allocate 1.32 GiB (GPU 0; 11.77 GiB total capacity; 4.32 GiB already allocated; 540.94 MiB free; 5.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Note this error only occurs, if I want to run the code with higher batch size e.g. >50