RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle

Question

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle

FrancescoSaverioZuppichini opened this issue 5 years ago · 7 comments

FrancescoSaverioZuppichini commented 5 years ago

Dear all,

It seems that torchbearer does not want to work for me. I am trying to simply classify images using resnet. You can find my code here (https://github.com/FrancescoSaverioZuppichini/PyTorch-Deep-Learning-Template/tree/feature/cuda-error), the main training logic is:

import time
from comet_ml import Experiment
import torchbearer
import torch.optim as optim
import torch.nn as nn
from torchsummary import summary
from Project import Project
from data import get_dataloaders
from data.transformation import train_transform, val_transform
from models import MyCNN, resnet18
from utils import device, show_dl
from torchbearer import Trial
from torchbearer.callbacks import CSVLogger, ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from callbacks import CometCallback
from logger import logging

if __name__ == '__main__':
    project = Project()
    # our hyperparameters
    params = {
        'lr': 0.001,
        'batch_size': 64,
        'epochs': 1,
        'model': 'resnet18-finetune',
        'id': time.time()
    }

    logging.info(f'Using device={device} 🚀')
    # everything starts with the data
    train_dl, val_dl, test_dl = get_dataloaders(
        project.data_dir,
        val_transform=val_transform,
        train_transform=train_transform,
        batch_size=params['batch_size'],
        num_workers=4,
    )
    # is always good practice to visualise some of the train and val images to be sure data-aug
    # is applied properly
    # show_dl(train_dl)
    # show_dl(test_dl)
    # define our comet experiment
    experiment = Experiment(api_key='8THqoAxomFyzBgzkStlY95MOf',
                            project_name="dl-pytorch-template", workspace="francescosaveriozuppichini")
    experiment.log_parameters(params)
    # create our special resnet18
    cnn = resnet18(n_classes=2).to(device)
    loss = nn.CrossEntropyLoss()
    # print the model summary to show useful information
    logging.info(summary(cnn, (3, 224, 244)))
    # define custom optimizer and instantiace the trainer `Model`
    optimizer = optim.Adam(cnn.parameters(), lr=params['lr'])
    # create our Trial object to train and evaluate the model
    trial = Trial(cnn, optimizer, loss, metrics=['acc', 'loss'],
                  callbacks=[
                      CometCallback(experiment),
                      ReduceLROnPlateau(monitor='val_loss',
                                        factor=0.1, patience=5),
                      EarlyStopping(monitor='val_acc', patience=5, mode='max'),
                      CSVLogger(str(project.checkpoint_dir / 'history.csv')),
                      ModelCheckpoint(str(project.checkpoint_dir / f'{params["id"]}-best.pt'), monitor='val_acc', mode='max')
    ]).to(device)
    trial.with_generators(train_generator=train_dl,
                          val_generator=val_dl, test_generator=test_dl)
    history = trial.run(epochs=params['epochs'], verbose=1)
    logging.info(history)
    preds = trial.evaluate(data_key=torchbearer.TEST_DATA)
    logging.info(f'test preds=({preds})')
    # experiment.log_metric('test_acc', test_acc)

I am running the same logic (same model) with poutyne and I have no problems. I really would like to switch to torchbearer

Error is:

2020-02-03 13:32:03,386 - [INFO] - None
  0%|                                                                                                                                                             | 0/1 [00:00<?, ?it/s]C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/pytorch_1579082551706/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/pytorch_1579082551706/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/pytorch_1579082551706/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [17,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/pytorch_1579082551706/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/pytorch_1579082551706/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/pytorch_1579082551706/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [22,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/pytorch_1579082551706/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [23,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/pytorch_1579082551706/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [25,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/pytorch_1579082551706/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/pytorch_1579082551706/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "c:/Users/Francesco/Documents/PyTorch-Deep-Learning-Template/main.py", line 64, in <module>
    history = trial.run(epochs=params['epochs'], verbose=1)
  File "C:\Users\Francesco\Anaconda3\envs\dl\lib\site-packages\torchbearer\trial.py", line 133, in wrapper
    res = func(self, *args, **kwargs)
  File "C:\Users\Francesco\Anaconda3\envs\dl\lib\site-packages\torchbearer\trial.py", line 988, in run
    final_metrics = self._fit_pass(state)[torchbearer.METRICS]
  File "C:\Users\Francesco\Anaconda3\envs\dl\lib\site-packages\torchbearer\trial.py", line 298, in wrapper
    res = func(self, *args, **kwargs)
  File "C:\Users\Francesco\Anaconda3\envs\dl\lib\site-packages\torchbearer\trial.py", line 1033, in _fit_pass
    state[torchbearer.OPTIMIZER].step(lambda: self.closure(state))
  File "C:\Users\Francesco\Anaconda3\envs\dl\lib\site-packages\torch\optim\adam.py", line 58, in step
    loss = closure()
  File "C:\Users\Francesco\Anaconda3\envs\dl\lib\site-packages\torchbearer\trial.py", line 1033, in <lambda>
    state[torchbearer.OPTIMIZER].step(lambda: self.closure(state))
  File "C:\Users\Francesco\Anaconda3\envs\dl\lib\site-packages\torchbearer\bases.py", line 382, in closure
    state[loss].backward(**state[torchbearer.BACKWARD_ARGS])
  File "C:\Users\Francesco\Anaconda3\envs\dl\lib\site-packages\comet_ml\monkey_patching.py", line 246, in wrapper
    return_value = original(*args, **kwargs)
  File "C:\Users\Francesco\Anaconda3\envs\dl\lib\site-packages\torch\tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Users\Francesco\Anaconda3\envs\dl\lib\site-packages\torch\autograd\__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

Do your library work for you? Do you use it in your daily workflow?

Thank you.

Cheers,

Francesco Saverio

Answer 1 · 2020-02-03T13:12:56.000Z

Hi Francesco,

I did some testing with the code you posted and I only got a CUDA error when the number of classes in the dataset and the number of classes the resnet expected were different. Once I set n_classes in the resnet18 to 200 (tiny imagenet has 200 classes), the error dissapeared.

These lines are the typically opaque cuda errors that pointed me to it:
pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.

The only other thing different in my code was that I turned off the comet_ml callback since I don't have my own api key for it, but I think that is just for logging?

For reference, my code:

import time
from comet_ml import Experiment
import torchbearer
import torch.optim as optim
import torch.nn as nn
from torchsummary import summary
from Project import Project
from data import get_dataloaders
from data.transformation import train_transform, val_transform
from models import MyCNN, resnet18
from utils import device, show_dl
from torchbearer import Trial
from torchbearer.callbacks import CSVLogger, ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from callbacks import CometCallback
from logger import logging

if __name__ == '__main__':
    project = Project()
    # our hyperparameters
    params = {
        'lr': 0.001,
        'batch_size': 64,
        'epochs': 1,
        'model': 'resnet18-finetune',
        'id': time.time()
    }

    logging.info(f'Using device={device} 🚀')
    # everything starts with the data
    train_dl, val_dl, test_dl = get_dataloaders(
        '/datasets/tinyimagenet/tiny-imagenet-200/train',
        '/datasets/tinyimagenet/tiny-imagenet-200/train',
        val_transform=val_transform,
        train_transform=train_transform,
        batch_size=params['batch_size'],
        num_workers=4,
    )
    # is always good practice to visualise some of the train and val images to be sure data-aug
    # is applied properly
    # show_dl(train_dl)
    # show_dl(test_dl)
    # define our comet experiment
    # experiment = Experiment(api_key='8THqoAxomFyzBgzkStlY95MOf',
    #                         project_name="dl-pytorch-template", workspace="francescosaveriozuppichini")
    # experiment.log_parameters(params)
    # create our special resnet18
    cnn = resnet18(n_classes=200).to(device)
    loss = nn.CrossEntropyLoss()
    # print the model summary to show useful information
    logging.info(summary(cnn, (3, 224, 244)))
    # define custom optimizer and instantiace the trainer `Model`
    optimizer = optim.Adam(cnn.parameters(), lr=params['lr'])
    # create our Trial object to train and evaluate the model
    trial = Trial(cnn, optimizer, loss, metrics=['acc', 'loss'],
                  callbacks=[
                      # CometCallback(experiment),
                      ReduceLROnPlateau(monitor='val_loss',
                                        factor=0.1, patience=5),
                      EarlyStopping(monitor='val_acc', patience=5, mode='max'),
                      CSVLogger(str(project.checkpoint_dir / 'history.csv')),
                      ModelCheckpoint(str(project.checkpoint_dir / f'{params["id"]}-best.pt'), monitor='val_acc', mode='max')
    ]).to(device)
    trial.with_generators(train_generator=train_dl,
                          val_generator=val_dl, test_generator=test_dl)
    history = trial.run(epochs=params['epochs'], verbose=2)
    logging.info(history)
    preds = trial.evaluate(data_key=torchbearer.TEST_DATA)
    logging.info(f'test preds=({preds})')
    # experiment.log_metric('test_acc', test_acc)

Let me know if you still have problems,
Matt

Answer 2 · 2020-02-03T20:02:16.000Z

Definitely an error somewhere in my dataset, sorry :) By the way, I have spotted a new error at https://github.com/pytorchbearer/torchbearer/blob/master/torchbearer/callbacks/early_stopping.py#L73 the check for self_monitor must be added. Also, in the TorchScheduler here https://github.com/pytorchbearer/torchbearer/blob/master/torchbearer/callbacks/early_stopping.py#L73 we should check if self._monitor is present in the state.

The current version of torchbearer still do not work when .evaluate is called. Let me know if I can help

Answer 3 · 2020-02-04T12:46:09.000Z

Since we had the same bug in a couple callbacks, I just made a generic get_metric function in bases.py which handles both checking for presence in the metrics dictionary and throwing the warning if it fails. I changed it in all the places I found that were accessing the metrics dict, so I think we should be okay for .evaluate calls on labelled test data now.

I think those changes were in the most recent release, so they should be on pip now too.

Thanks for pointing it out though, let us know if you notice anything else!
Matt

Answer 4 · 2020-02-04T15:06:52.000Z

Hi Matt,

the current release does not work. You can use my code to test it, just run it even if the tiny imagenet. Would it possible to see how you test it?

Thanks :)

Francesco Saverio

Answer 5 · 2020-02-04T15:19:54.000Z

Sorry, you're right, that's my bad. We throw the warning but don't then exit the callbacks, as you say. I'll add some logic to make sure we quit the callbacks that use it if it fails.

The tests are all in the repo, but I only added a test to check the warning was thrown, not if the callback is then continued afterwards. I should add extra tests to make sure we don't do this again, but we're fairly busy at the moment so it might have to wait until later on.

Matt

Answer 6 · 2020-02-06T07:59:14.000Z

Sure :) If you point me in the right direction I can fix the code and make a PR

Answer 7 · 2020-02-14T10:25:37.000Z

Hi @MattPainter01 any news? I hope you have more spare time now, the fix should be easy (I think). Maybe you can integrate a full example with all the callbacks and check if the code execute correctly.

I really would like to start using torchbearer :)