lucidrains/byol-pytorch

Code error when using torch.nn.DataParallel for multi-gpu: AssertionError: hidden layer avgpool never emitted an output

SilverUnicorn opened this issue · 4 comments

First of all, thanks for your implementation.
And this code runs well when i use single gpu, but when i try to use multi-gpu to speed up my pretraining process, somthing goes run.

And the error looks like this(4 gpus used):
Traceback (most recent call last):
File "ssl_train.py", line 478, in
main()
File "ssl_train.py", line 474, in main
run(args)
File "ssl_train.py", line 320, in run
summary_pretrain = pretrain_epoch(summary_pretrain, model, optimizer, dataloader_pretrain)
File "ssl_train.py", line 102, in pretrain_epoch
loss = model(data_pretrain)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
AssertionError: Caught AssertionError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jzj/projects/medical/camelyon16/bin/data/byol_pytorch.py", line 240, in forward
online_proj_one, _ = self.online_encoder(image_one)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jzj/projects/medical/camelyon16/bin/data/byol_pytorch.py", line 155, in forward
representation = self.get_representation(x)
File "/home/jzj/projects/medical/camelyon16/bin/data/byol_pytorch.py", line 151, in get_representation
assert hidden is not None, f'hidden layer {self.layer} never emitted an output'
AssertionError: hidden layer avgpool never emitted an output

so i try to print the hidden in get_representation(self, x):
def get_representation(self, x):
if self.layer == -1:
return self.net(x)

    if not self.hook_registered:
        self._register_hook()

    _ = self.net(x)
    hidden = self.hidden
    print('###################################')
    print(hidden)
    print('###################################')
    self.hidden = None
    assert hidden is not None, f'hidden layer {self.layer} never emitted an output'
    return hidden

and it turn out to like this:
###################################
tensor([[1.0303, 1.0782, 1.0756, ..., 1.1541, 0.8629, 1.0167],
[0.9641, 1.0843, 1.1032, ..., 0.9906, 1.0737, 1.0357]],
grad_fn=)
###################################
###################################
tensor([[1.0060, 1.0688, 1.0976, ..., 1.0569, 0.9572, 1.2543],
[1.0713, 1.1613, 1.0059, ..., 1.0245, 0.9633, 0.8983]],
grad_fn=)
###################################
###################################
tensor([[1.0303, 1.0782, 1.0756, ..., 1.1541, 0.8629, 1.0167],
[0.9641, 1.0843, 1.1032, ..., 0.9906, 1.0737, 1.0357]])
###################################
###################################
tensor([[1.0060, 1.0688, 1.0976, ..., 1.0569, 0.9572, 1.2543],
[1.0713, 1.1613, 1.0059, ..., 1.0245, 0.9633, 0.8983]])
###################################
###################################
None
###################################
###################################
None
###################################
###################################
None
###################################
###################################
None
###################################

@SilverUnicorn Hi! thanks for reporting this! would you like to give 0.5.3 a try? I implemented the solution that @Vurkty linked to!

8b0be48

thanks, this code works in my code and fix this error

however, other error emerges(4 gpu used):

/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:64: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
File "ssl_train2.py", line 598, in
main()
File "ssl_train2.py", line 594, in main
run(args)
File "ssl_train2.py", line 435, in run
summary_pretrain = pretrain_epoch(summary_pretrain, model, optimizer, dataloader_pretrain)
File "ssl_train2.py", line 112, in pretrain_epoch
loss = model(data_pretrain)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jzj/projects/medical/camelyon16/bin/data/byol_pytorch2.py", line 243, in forward
target_proj_two, _ = target_encoder(image_two)
File "/home/jzj/anaconda3/envs/medical/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jzj/projects/medical/camelyon16/bin/data/byol_pytorch2.py", line 149, in forward
representation = self.get_representation(x)
File "/home/jzj/projects/medical/camelyon16/bin/data/byol_pytorch2.py", line 142, in get_representation
hidden = self.hidden[x.device]
KeyError: device(type='cuda', index=0)

===========================
And the code lokes like this:

data_pretrain = next(dataiter_pretrain)
data_pretrain = Variable(data_pretrain.float().to(device, non_blocking=True))

loss = model(data_pretrain)

optimizer.zero_grad()
loss.sum().backward()
optimizer.step()
model.module.update_moving_average()

loss_data = loss.data
loss_sum += loss_data

=============================