CUDA out of memory
loretoparisi opened this issue ยท 8 comments
I get a OOM when loading the upsample
model:
options_up = model_and_diffusion_defaults_upsampler()
options_up['use_fp16'] = has_cuda
options_up['timestep_respacing'] = 'fast27' # use 27 diffusion steps for very fast sampling
model_up, diffusion_up = create_model_and_diffusion(**options_up)
model_up.eval()
if has_cuda:
model_up.convert_to_fp16()
model_up.to(device)
model_up.load_state_dict(load_checkpoint('upsample', device))
print('total upsampler parameters', sum(x.numel() for x in model_up.parameters()))
the allocation error was
RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 3.94 GiB total capacity; 3.00 GiB already allocated; 30.94 MiB free; 3.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
mynvidia-smi
is
loreto@ombromanto:~/Projects/glide-text2im$ nvidia-smi
Wed Dec 22 20:39:15 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... Off | 00000000:01:00.0 On | N/A |
| 45% 23C P5 N/A / 75W | 3994MiB / 4033MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1094 G /usr/lib/xorg/Xorg 121MiB |
| 0 N/A N/A 1926 G /usr/bin/gnome-shell 26MiB |
| 0 N/A N/A 3532 G ...AAAAAAAA== --shared-files 22MiB |
| 0 N/A N/A 4795 C /usr/bin/python 3819MiB |
+-----------------------------------------------------------------------------+
I haven't tested this code with less than 16GB of GPU memory, but this is a bit surprising since each model is roughly 400M parameters and therefore around 800MB of memory.
One suggestion: try loading the checkpoint on CPU, and then moving to GPU, like so:
options_up = model_and_diffusion_defaults_upsampler()
options_up['use_fp16'] = has_cuda
options_up['timestep_respacing'] = 'fast27' # use 27 diffusion steps for very fast sampling
model_up, diffusion_up = create_model_and_diffusion(**options_up)
model_up.load_state_dict(load_checkpoint('upsample', th.device('cpu')))
model_up.eval()
if has_cuda:
model_up.convert_to_fp16()
model_up.to(device)
print('total upsampler parameters', sum(x.numel() for x in model_up.parameters()))
I haven't tested this code with less than 16GB of GPU memory, but this is a bit surprising since each model is roughly 400M parameters and therefore around 800MB of memory.
One suggestion: try loading the checkpoint on CPU, and then moving to GPU, like so:
options_up = model_and_diffusion_defaults_upsampler() options_up['use_fp16'] = has_cuda options_up['timestep_respacing'] = 'fast27' # use 27 diffusion steps for very fast sampling model_up, diffusion_up = create_model_and_diffusion(**options_up) model_up.load_state_dict(load_checkpoint('upsample', th.device('cpu'))) model_up.eval() if has_cuda: model_up.convert_to_fp16() model_up.to(device) print('total upsampler parameters', sum(x.numel() for x in model_up.parameters()))
Amazing, thanks! In fact I read total upsampler parameters 398361286
.
Using the CPU trick it worked with the 4GB GTX 1050. Also it took few seconds to generate this curious dog
Maybe this approach could be a guideline in the docs...
I still get the following error even when trying the above code
RuntimeError: CUDA out of memory. Tried to allocate 5.27 GiB (GPU 0; 11.77 GiB total capacity; 6.51 GiB already allocated; 1.50 GiB free; 7.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It looks like you had a 12-GB-VRAM GPU, while OP managed to work with a 4-GB-VRAM GPU.
Try clearing the memory, e.g. by restarting the Colab session if you use Google Colab.
Hi @woctezuma . Thank you for your suggestion. I am running the code locally. I am not sure how to clear the memory in the code.
Also one thing to note, I am using the 'Inpaint' script:
from typing import Tuple
from PIL import Image
from datetime import datetime
import numpy as np
import torch as th
import torch.nn.functional as F
from glide_text2im.download import load_checkpoint
from glide_text2im.model_creation import (
create_model_and_diffusion,
model_and_diffusion_defaults,
model_and_diffusion_defaults_upsampler
)
def read_image(path: str, size: int = 256) -> Tuple[th.Tensor, th.Tensor]:
pil_img = Image.open(path).convert('RGB')
pil_img = pil_img.resize((size, size), resample=Image.BICUBIC)
img = np.array(pil_img)
return th.from_numpy(img)[None].permute(0, 3, 1, 2).float() / 127.5 - 1
#########################################
# Sampling parameters
prompt = "a evil cyborg"
batch_size = 16
guidance_scale = 5.0
# Tune this parameter to control the sharpness of 256x256 images.
# A value of 1.0 is sharper, but sometimes results in grainy artifacts.
upsample_temp = 0.997
# Source image we are inpainting
source_image_256 = read_image('notebooks/img_219.png', size=256)
source_image_64 = read_image('notebooks/img_219.png', size=64)
# The mask should always be a boolean 64x64 mask, and then we
# can upsample it for the second stage.
source_mask_64 = th.ones_like(source_image_64)[:, :1]
source_mask_64[:, :, 20:] = 0
source_mask_256 = F.interpolate(source_mask_64, (256, 256), mode='nearest')
#########################################
#########################################
# This notebook supports both CPU and GPU.
# On CPU, generating one sample may take on the order of 20 minutes.
# On a GPU, it should be under a minute.
has_cuda = th.cuda.is_available()
device = th.device('cpu' if not has_cuda else 'cuda')
# Make a filename
xprompt = prompt.replace(" ", "_")[:] + "-gs_" + str(guidance_scale)
# Create base model.
options = model_and_diffusion_defaults()
options['inpaint'] = True
options['use_fp16'] = has_cuda
options['timestep_respacing'] = '100' # use 100 diffusion steps for fast sampling
model, diffusion = create_model_and_diffusion(**options)
model.eval()
if has_cuda:
model.convert_to_fp16()
model.to(device)
model.load_state_dict(load_checkpoint('base-inpaint', device))
print('total base parameters', sum(x.numel() for x in model.parameters()))
# Create upsampler model.
options_up = model_and_diffusion_defaults_upsampler()
options_up['inpaint'] = True
options_up['use_fp16'] = has_cuda
options_up['timestep_respacing'] = 'fast27' # use 27 diffusion steps for very fast sampling
model_up, diffusion_up = create_model_and_diffusion(**options_up)
model_up.eval()
if has_cuda:
model_up.convert_to_fp16()
model_up.to(device)
model_up.load_state_dict(load_checkpoint('upsample-inpaint', device))
print('total upsampler parameters', sum(x.numel() for x in model_up.parameters()))
def save_images(batch: th.Tensor):
""" Save images """
scaled = ((batch + 1) * 127.5).round().clamp(0, 255).to(th.uint8).cpu()
reshaped = scaled.permute(2, 0, 3, 1).reshape([batch.shape[2], -1, 3])
# Save strip
stamp = datetime.today().strftime('%H%M%S')
Image.fromarray(reshaped.numpy()).save(f'output-{stamp}.png')
# Save individual
for _ in range(0, batch.shape[0]):
test_single = scaled.select(0, _)
test_reshape = test_single.permute(1, 2, 0).reshape([batch.shape[2], -1, 3])
Image.fromarray(test_reshape.numpy()).save(f'{xprompt}-{_}-{stamp}.png')
# Visualise the image we are inpainting - if you want to, uncomment
# save_images(source_image_256 * source_mask_256)
##############################
# Sample from the base model #
##############################
# Create the text tokens to feed to the model.
tokens = model.tokenizer.encode(prompt)
tokens, mask = model.tokenizer.padded_tokens_and_mask(
tokens, options['text_ctx']
)
# Create the classifier-free guidance tokens (empty)
full_batch_size = batch_size * 2
uncond_tokens, uncond_mask = model.tokenizer.padded_tokens_and_mask(
[], options['text_ctx']
)
# Pack the tokens together into model kwargs.
model_kwargs = dict(
tokens=th.tensor(
[tokens] * batch_size + [uncond_tokens] * batch_size, device=device
),
mask=th.tensor(
[mask] * batch_size + [uncond_mask] * batch_size,
dtype=th.bool,
device=device,
),
# Masked inpainting image
inpaint_image=(source_image_64 * source_mask_64).repeat(full_batch_size, 1, 1, 1).to(device),
inpaint_mask=source_mask_64.repeat(full_batch_size, 1, 1, 1).to(device),
)
# Create an classifier-free guidance sampling function
def model_fn(x_t, ts, **kwargs):
half = x_t[: len(x_t) // 2]
combined = th.cat([half, half], dim=0)
model_out = model(combined, ts, **kwargs)
eps, rest = model_out[:, :3], model_out[:, 3:]
cond_eps, uncond_eps = th.split(eps, len(eps) // 2, dim=0)
half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
eps = th.cat([half_eps, half_eps], dim=0)
return th.cat([eps, rest], dim=1)
def denoised_fn(x_start):
# Force the model to have the exact right x_start predictions
# for the part of the image which is known.
return (
x_start * (1 - model_kwargs['inpaint_mask'])
+ model_kwargs['inpaint_image'] * model_kwargs['inpaint_mask']
)
# Sample from the base model.
model.del_cache()
samples = diffusion.p_sample_loop(
model_fn,
(full_batch_size, 3, options["image_size"], options["image_size"]),
device=device,
clip_denoised=True,
progress=True,
model_kwargs=model_kwargs,
cond_fn=None,
denoised_fn=denoised_fn,
)[:batch_size]
model.del_cache()
# 64x64 output - not worth saving, but uncomment if you like!
# save_images(samples)
##############################
# Upsample the 64x64 samples #
##############################
tokens = model_up.tokenizer.encode(prompt)
tokens, mask = model_up.tokenizer.padded_tokens_and_mask(
tokens, options_up['text_ctx']
)
# Create the model conditioning dict.
model_kwargs = dict(
# Low-res image to upsample.
low_res=((samples + 1) * 127.5).round() / 127.5 - 1,
# Text tokens
tokens=th.tensor(
[tokens] * batch_size, device=device
),
mask=th.tensor(
[mask] * batch_size,
dtype=th.bool,
device=device,
),
# Masked inpainting image.
inpaint_image=(source_image_256 * source_mask_256).repeat(batch_size, 1, 1, 1).to(device),
inpaint_mask=source_mask_256.repeat(batch_size, 1, 1, 1).to(device),
)
def denoised_fn(x_start):
# Force the model to have the exact right x_start predictions
# for the part of the image which is known.
return (
x_start * (1 - model_kwargs['inpaint_mask'])
+ model_kwargs['inpaint_image'] * model_kwargs['inpaint_mask']
)
# Sample from the base model.
model_up.del_cache()
up_shape = (batch_size, 3, options_up["image_size"], options_up["image_size"])
up_samples = diffusion_up.p_sample_loop(
model_up,
up_shape,
noise=th.randn(up_shape, device=device) * upsample_temp,
device=device,
clip_denoised=True,
progress=True,
model_kwargs=model_kwargs,
cond_fn=None,
denoised_fn=denoised_fn,
)[:batch_size]
model_up.del_cache()
# Show the output
save_images(up_samples)
Setting the batch size to greater than 16 will result in an OOM error.
here is my nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:43:00.0 On | N/A |
| 0% 44C P0 34W / 170W | 1100MiB / 12288MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3555 G /usr/lib/xorg/Xorg 643MiB |
| 0 N/A N/A 4427 G /usr/bin/gnome-shell 185MiB |
| 0 N/A N/A 9394 G ...275038181146881444,131072 23MiB |
| 0 N/A N/A 14262 G telegram-desktop 2MiB |
| 0 N/A N/A 14661 G /usr/lib/firefox/firefox 204MiB |
| 0 N/A N/A 16530 G gnome-control-center 2MiB |
| 0 N/A N/A 16806 G .../debug.log --shared-files 21MiB |
| 0 N/A N/A 24302 G ..._24179.log --shared-files 13MiB |
+-----------------------------------------------------------------------------+
I ran into the same issue on a 4GB GPU. Oddly enough, loading the upsample model before the base model worked for me with no other changes.
@kgullion Thanks, I will try your suggestion!
@kgullion thanks for your suggestion. But I ave tried to run the upscale model before the base model and I still get the following error:
RuntimeError: CUDA out of memory. Tried to allocate 1.32 GiB (GPU 0; 11.77 GiB total capacity; 4.32 GiB already allocated; 540.94 MiB free; 5.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Note this error only occurs, if I want to run the code with higher batch size e.g. >50