DDP训练出错

Question

DDP训练出错

wmn931201 opened this issue a year ago · 1 comments

单机多卡分布式训练的时候，出现如下错误：

Traceback (most recent call last):
File "classification_flow_resnet18_distributetrain.py", line 446, in
res = main()
File "classification_flow_resnet18_distributetrain.py", line 315, in main
train_one_epoch(model, criterion, optimizer, data_loader_train, "cuda", epoch, 100)
File "classification_flow_resnet18_distributetrain.py", line 88, in train_one_epoch
loss.backward()
File "/venv/py38_pytorch110-rDELLU7P/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/venv/py38_pytorch110-rDELLU7P/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: setStorage: sizes [1000], strides [1], storage offset 2874352, and itemsize 4 requiring a storage size of 11501408 are out of bounds for storage of size 11497824

大概code如下：
model.train()
backend = BackendType.Novatek
model = prepare_by_platform(model, backend)
if distributed:
'''
if args.sync_bn:
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
'''
local_rank = args.local_rank
torch.cuda.set_device(local_rank)
model.cuda(local_rank)
model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[local_rank],output_device=local_rank)

# calibration loop
model.eval()
enable_calibration(model)
for i, (image, target) in enumerate(data_loader_train):
    image, target = image.cuda(), target.cuda()
    model(image)
    
model.train()
enable_quantization(model)
for epoch in range(args.num_finetune_epochs):
    # Training a single epch
    train_one_epoch(model, criterion, optimizer, data_loader_train, "cuda", epoch, 100)

pytorch ==1.10.0
好像说是把pytorch版本降低到1.9.0，就不会出错，还有就是weight是per channel，用的是LSQ，当改成per tensor的时候，可以正常训练，因为per tensor直接调用的是 torch._fake_quantize_learnable_per_tensor_affine, cuda c写的, per channel是python code实现的，如下

def _fake_quantize_learnable_per_channel_affine_training(x, scale, zero_point, ch_axis, quant_min, quant_max, grad_factor):
zero_point = (zero_point.round() - zero_point).detach() + zero_point
new_shape = [1] * len(x.shape)
new_shape[ch_axis] = x.shape[ch_axis]
scale = grad_scale(scale, grad_factor).reshape(new_shape)
zero_point = grad_scale(zero_point, grad_factor).reshape(new_shape)
x = x / scale + zero_point
x = (x.round() - x).detach() + x
x = torch.clamp(x, quant_min, quant_max)
return (x - zero_point) * scale

请问除了降pytorch版本还有什么解决方法吗？

Answer 1 · 2023-03-21T10:15:03.000Z

找到原因了，将LearnableFakeQuantize的__init__修改如下：

class LearnableFakeQuantize(QuantizeBase):
        def __init__(self, observer, scale=1., zero_point=0., channel_len=-1, use_grad_scaling=True, **observer_kwargs):
            super(LearnableFakeQuantize, self).__init__(observer, **observer_kwargs)
            self.use_grad_scaling = use_grad_scaling
            if (channel_len != -1) and (self.is_per_channel):
                assert isinstance(channel_len,int) and channel_len > 0, "Channel size must be a positive integer"
                self.scale = Parameter(torch.tensor([scale] * channel_len))
                self.zero_point = Parameter(torch.tensor([zero_point] * channel_len))
            else:
                self.scale = Parameter(torch.tensor([scale]))
                self.zero_point = Parameter(torch.tensor([zero_point]))

per channel的时候，最好是正确定义size（之前无论什么量化规格，scale和zp都是shape为1），这样修改之后，同样需要修改qat.conv中的如下code:

    self.weight_fake_quant = self.qconfig.weight(channel_len=out_channels)
    self.bias_fake_quant = self.qconfig.bias(channel_len=out_channels)