thunlp/OpenKE

单机多GPU报错

Panhaolin2001 opened this issue · 0 comments

我在train_transe_FB15K237.py里面添加了
if torch.cuda.device_count() > 1:
model = torch.nn.DataParallel(model)

但是报错:
Traceback (most recent call last):
File "/root/OpenKE/train_transe_FB15K237.py", line 46, in
trainer.run()
File "/root/OpenKE/openke/config/Trainer.py", line 93, in run
loss = self.train_one_step(data)
File "/root/OpenKE/openke/config/Trainer.py", line 45, in train_one_step
loss = self.model({
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 644, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/OpenKE/openke/module/strategy/NegativeSampling.py", line 26, in forward
n_score = self._get_negative_score(score)
File "/root/OpenKE/openke/module/strategy/NegativeSampling.py", line 20, in _get_negative_score
negative_score = negative_score.view(-1, self.batch_size).permute(1, 0)
RuntimeError: shape '[-1, 2721]' is invalid for input of size 14966