[Bug]: - Immediately Out of Memory

Question

[Bug]: - Immediately Out of Memory

HeinrichAD opened this issue 2 years ago · 1 comments

HeinrichAD commented 2 years ago

Module

Layers

Contact Details

No response

Current Behavior

I saw in your paper, that deel-lip also supports models which are larger than tiny toy models. Hence, I tried to build the vgg16 network architecture with deel-torchlip layers and train it on the cifar10 dataset.

Unfortunately, while training I get always an out of memory exception. This already happens in the first batch from the first epoch while calling outputs = model(inputs). This is even before the first loss function call. I also notice that the training works fine if you use the vanilla version of this model (but same loss function etc.).

Expected Behavior

Maybe the vgg16 network is to large for deel-lip, but at least I expected to get further than first batch from the first epoch.

I would appreciate it if somebody could offer a hint in the right direction. Or maybe know another well known network architecture like vgg16 which would work with deel-torchlip and the cifar10 dataset.

Version

v0.1.0

Environment

- OS: Linux arch 5.18.9-arch1-1
- Python version: 3.7
- PyTorch version: 1.11.0+cu102
- Cuda version: 10.2
- Packages used version: deel-torchlip torch torchvision tqdm

Relevant log output

Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: Dropout(p=0, inplace=False)
Sequential model contains a layer which is not a Lipschitz layer: ReLU(inplace=True)
Sequential model contains a layer which is not a Lipschitz layer: Dropout(p=0, inplace=False)
Sequential model contains a layer which is not a Lipschitz layer: Flatten(start_dim=1, end_dim=-1)
Files already downloaded and verified
Epochs:   0%|                    | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "./test_lib_vgg16.py", line 109, in <module>
    outputs = model(inputs)
  File "/home/user/projects/Test/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/projects/Test/.venv/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/user/projects/Test/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/projects/Test/.venv/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/user/projects/Test/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1117, in _call_impl
    result = hook(self, input)
  File "/home/user/projects/Test/.venv/lib/python3.7/site-packages/deel/torchlip/utils/hook_norm.py", line 103, in __call__
    setattr(module, self.name, self.compute_weight(module, inputs))
  File "/home/user/projects/Test/.venv/lib/python3.7/site-packages/deel/torchlip/utils/bjorck_norm.py", line 49, in compute_weight
    return bjorck_normalization(self.weight(module), self.n_iterations)
  File "/home/user/projects/Test/.venv/lib/python3.7/site-packages/deel/torchlip/normalizers.py", line 75, in bjorck_normalization
    w_mat, torch.mm(w_mat.t(), w_mat)
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.91 GiB total capacity; 11.04 GiB already allocated; 41.00 MiB free; 11.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

To Reproduce

#!/usr/bin/env python3
from collections import OrderedDict
from deel.torchlip import (
    HKRMulticlassLoss,
    ScaledAdaptiveAvgPool2d,
    Sequential,
    SpectralConv2d,
    SpectralLinear
)
import torch
from torch.nn import Dropout, Flatten, MaxPool2d, ReLU
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from tqdm.auto import tqdm


# config
seed = 42
batch_size = 1024
epochs = 2
learning_rate = 1e-3
lip_alpha = 200
lip_min_margin = 0.125
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# determinism
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

# model
num_classes = 10  # CIFAR10 has 10 classes
dropout_p = 0  # deactivate dropout since its breaks the Lipschitz property
model = Sequential(OrderedDict([
    ("features", Sequential(
        SpectralConv2d(3, 64, kernel_size=3, padding=1),
        ReLU(True),
        SpectralConv2d(64, 64, kernel_size=3, padding=1),
        ReLU(True),
        MaxPool2d(kernel_size=2, stride=2),
        SpectralConv2d(64, 128, kernel_size=3, padding=1),
        ReLU(True),
        SpectralConv2d(128, 128, kernel_size=3, padding=1),
        ReLU(True),
        MaxPool2d(kernel_size=2, stride=2),
        SpectralConv2d(128, 256, kernel_size=3, padding=1),
        ReLU(True),
        SpectralConv2d(256, 256, kernel_size=3, padding=1),
        ReLU(True),
        SpectralConv2d(256, 256, kernel_size=3, padding=1),
        ReLU(True),
        MaxPool2d(kernel_size=2, stride=2),
        SpectralConv2d(256, 512, kernel_size=3, padding=1),
        ReLU(True),
        SpectralConv2d(512, 512, kernel_size=3, padding=1),
        ReLU(True),
        SpectralConv2d(512, 512, kernel_size=3, padding=1),
        ReLU(True),
        MaxPool2d(kernel_size=2, stride=2),
        SpectralConv2d(512, 512, kernel_size=3, padding=1),
        ReLU(True),
        SpectralConv2d(512, 512, kernel_size=3, padding=1),
        ReLU(True),
        SpectralConv2d(512, 512, kernel_size=3, padding=1),
        ReLU(True),
        MaxPool2d(kernel_size=2, stride=2),
    )),
    ("avgpool", ScaledAdaptiveAvgPool2d((1, 1))),  # CIFAR 10 (7, 7) to (1, 1)
    ("flatten", Flatten()),
    ("classifier", Sequential(
        SpectralLinear(512 * 1 * 1, 4096),  # CIFAR 10 512*7*7 to 512*1*1
        ReLU(True),
        Dropout(p=dropout_p),
        SpectralLinear(4096, 4096),
        ReLU(True),
        Dropout(p=dropout_p),
        SpectralLinear(4096, num_classes),
    )),
]))
#model = model.vanilla_export()  # vanilla model training has no OOM issues
model.to(device)

# data
trainset = datasets.CIFAR10("data/raw", train=True, download=True, transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
]))
trainloader = DataLoader(
    trainset,
    batch_size=batch_size,
    generator=None if seed is None else torch.Generator().manual_seed(seed),
    shuffle=True,
    pin_memory=True,
)
labels = ["plane", "car", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]

# train
optimizer = torch.optim.Adam(model.parameters(), learning_rate)
loss_fn = HKRMulticlassLoss(alpha=lip_alpha, min_margin=lip_min_margin)
for epoch_idx in tqdm(range(epochs), position=0, leave=True, desc="Epochs"):
    progressbar = tqdm(trainloader, position=1, leave=False, desc=f"{(epoch_idx+1):>3}. Training")
    for idx, (inputs, targets) in enumerate(progressbar):
        #print(idx+1)  # OOM in first epoch and first batch
        targets = torch.nn.functional.one_hot(targets, num_classes=len(labels))
        inputs, targets = inputs.to(device, non_blocking=True), targets.to(device, non_blocking=True)

        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

print("done")

Answer 1 · 2022-07-07T12:43:04.000Z

Thanks for the issue
Each lipschitz layer has twice the number of weights as a classical layer, so depending on your GPU but OOM can occur.
We will check on our platforms with your code