google-research/text-to-text-transfer-transformer

CUDA OOM with HF Model

fadebek1 opened this issue · 0 comments

Hi, has the HF model been tested to train on CUDA? I am getting OOM errors no matter how small the batch size is. Im using a V100 32GB. A code snippet is attached below. I've profiled the individual steps of hf_model.train using nvidia-smi and narrowed down the issue to here. The GPU memory spikes to fill up 32GB after the dataset is loaded. Is all of the data being loaded onto the GPU? Is this supposed to happen? Is there a way to disable this behavior? The error message also supports this as PyTorch only reserved 2.8GB for the model itself.

import t5.data.mixtures
import functools
import t5.models
import seqio
import torch
import tensorflow_datasets as tfds
from transformers import Adafactor


model = t5.models.HfPyTorchModel("google/t5-v1_1-base", "/tmp" , torch.device("cuda"))

TaskRegistry = seqio.TaskRegistry
for b in tfds.text.glue.Glue.builder_configs.values():
     task = TaskRegistry.get("glue_%s_v002" % b.name)
     task.source._tfds_dataset._name = task.source._tfds_dataset._name.replace("1.0.0", "2.0.0")

model.train(
     "glue_v002_proportional",
     262144,
     5000
     {"inputs": 512, "targets": 512},
     "train",
     16,
     functools.partial(Adafactor, lr=1e-3, relative_step=False),
        )
OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 31.75 GiB total capacity;
2.71 GiB already allocated; 45.75 MiB free; 2.79 GiB reserved in total by PyTorch) If reserved 
memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.