RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch

Question

RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch

amirj opened this issue 6 years ago · 1 comments

I have an implicit MF model (so-called model_implicit) in the following. I'm going to initialize the user_embedding and item_embedding from this model to build a new model:

# create the representation layer
bilinear = BilinearNet(num_users=dataset_implicit.num_users,
                             num_items=dataset_implicit.num_items,
                             embedding_dim=LATENT_DIM,
                             user_embedding_layer=model_implicit._net.user_embeddings,
                             item_embedding_layer=model_implicit._net.item_embeddings)

when trying to train the model initialize with the above representation:

newmodel = ImplicitFactorizationModel(loss='bpr',
                                                               representation=binonlinear,
                                                               ....)

I got the following strange error:

~/user_preferences_model/multi_implicit.py in fit(self, interactions, verbose)
    240 
    241                 # leverage the current batch of users/items as positive instances
--> 242                 positive_prediction = self._net(batch_user, batch_item)
    243 
    244                 # find some negative instances for the current batch_users

/users/tr.amirhj/anaconda3/envs/keras_gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    489             result = self._slow_forward(*input, **kwargs)
    490         else:
--> 491             result = self.forward(*input, **kwargs)
    492         for hook in self._forward_hooks.values():
    493             hook_result = hook(self, input, result)

~/user_preferences_model/representations.py in forward(self, user_ids, item_ids)
    105         item_embedding = item_embedding.squeeze()
    106 
--> 107         output_representation_users = self._net(user_embedding)
    108         output_representation_items = self._net(item_embedding)
    109 

/users/tr.amirhj/anaconda3/envs/keras_gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    489             result = self._slow_forward(*input, **kwargs)
    490         else:
--> 491             result = self.forward(*input, **kwargs)
    492         for hook in self._forward_hooks.values():
    493             hook_result = hook(self, input, result)

/users/tr.amirhj/anaconda3/envs/keras_gpu/lib/python3.6/site-packages/torch/nn/modules/container.py in forward(self, input)
     89     def forward(self, input):
     90         for module in self._modules.values():
---> 91             input = module(input)
     92         return input
     93 

/users/tr.amirhj/anaconda3/envs/keras_gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    489             result = self._slow_forward(*input, **kwargs)
    490         else:
--> 491             result = self.forward(*input, **kwargs)
    492         for hook in self._forward_hooks.values():
    493             hook_result = hook(self, input, result)

/users/tr.amirhj/anaconda3/envs/keras_gpu/lib/python3.6/site-packages/torch/nn/modules/linear.py in forward(self, input)
     53 
     54     def forward(self, input):
---> 55         return F.linear(input, self.weight, self.bias)
     56 
     57     def extra_repr(self):

/users/tr.amirhj/anaconda3/envs/keras_gpu/lib/python3.6/site-packages/torch/nn/functional.py in linear(input, weight, bias)
    990     if input.dim() == 2 and bias is not None:
    991         # fused op is marginally faster
--> 992         return torch.addmm(bias, input, weight.t())
    993 
    994     output = input.matmul(weight.t())

RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/THCGeneral.cpp:411

What's the problem?
@maciejkula

Answer 1 · 2018-11-14T07:54:09.000Z

It was my fault. I just use an old model with different dimensions.