seba-1511/dist_tuto.pth

train_dist.py incompatible with new PyTorch

Opened this issue · 5 comments

When I tried to run train_dist.py, I ran into multiple errors that seem to be due to the fact that this code is written on older versions of Python and PyTorch.

One issue, torch.utils.data.DataLoader expects batch_size to be integer

File "", line 77, in partition_dataset
partition, batch_size= bsz, shuffle=True)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 179, in init
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 162, in init
"but got batch_size={}".format(batch_size))
ValueError: batch_size should be a positive integer value, but got batch_size=64.0

Another issue, no idea what is happening here

File "", line 85, in average_gradients
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, group=0)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce
work = group.allreduce([tensor], opts)
AttributeError: 'int' object has no attribute 'allreduce'

Also, when I turned on cuda in run():

model = model.cuda(rank)
...
data, target = Variable(data.cuda(rank)), Variable(target.cuda(rank))
...

I encountered the following error, which is caused by .cuda(rank) since it automatically assumes I have multiple GPUs:

Process Process-4:
Traceback (most recent call last):
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "", line 120, in init_processes
fn(rank, size)
File "", line 95, in run
model = model.cuda(rank)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 199, in _apply
param.data = fn(param.data)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/ipykernel/main.py:55: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
Process Process-3:
Traceback (most recent call last):
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "", line 120, in init_processes
fn(rank, size)
File "", line 109, in run
average_gradients(model)
File "", line 85, in average_gradients
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 907, in all_reduce
work.wait()
RuntimeError: [/opt/conda/conda-bld/pytorch_1556653183467/work/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [127.0.0.1]:57517

When I tried to run train_dist.py, I ran into multiple errors that seem to be due to the fact that this code is written on older versions of Python and PyTorch.

One issue, torch.utils.data.DataLoader expects batch_size to be integer

File "", line 77, in partition_dataset
partition, batch_size= bsz, shuffle=True)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 179, in init
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 162, in init
"but got batch_size={}".format(batch_size))
ValueError: batch_size should be a positive integer value, but got batch_size=64.0

Same here, with

In [2]: torch.__version__                                                       
Out[2]: '1.2.0'

Workaround:

train_set = torch.utils.data.DataLoader(partition, batch_size=int(bsz), shuffle=True)

Another issue, no idea what is happening here

File "", line 85, in average_gradients
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, group=0)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce
work = group.allreduce([tensor], opts)
AttributeError: 'int' object has no attribute 'allreduce'

This is still there.

@marcomilanesio just remove group=0 in dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, group=0)

@marcomilanesio just remove group=0 in dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, group=0)

perfect !

line120, should change to epoch_loss += loss.data.item()