Question about Mixed Precision DNNs

Question

Question about Mixed Precision DNNs

PuNeal opened this issue 2 years ago · 5 comments

hello, I have a question about the implementations of CASE U3:
why the formula of calculating bitwidth is different between weight and activation?
https://github.com/sony/ai-research-code/blob/master/mixed-precision-dnns/train_resnet.py#L174
https://github.com/sony/ai-research-code/blob/master/mixed-precision-dnns/train_resnet.py#L264

Hope for reply. Thanks!

Answer 1 · 2022-09-28T12:14:50.000Z

Hello, Thank you for your interest.
Actually the two lines that you mentioned refer to two different setups. Namely fixed-point parametrized with d and xmax for L174 and pow2 parametrized by x_min and x_max for L264.
The comparison should be between L174 and L252 with the same setup. The only difference is +1 in the weight computation. This corresponds to the sign bit which is needed for the sign of the weight. On the other hand, the activation do not require the sign bit when using a ReLU activation function.

Hope this answers your question.

Answer 2 · 2022-09-29T03:17:34.000Z

Got it, thank you!

Answer 3 · 2022-11-17T06:38:48.000Z

Hi, I'm confused about the backward process during training, as the code below:

    xmax = clip_scalar(xmax, xmax_min, xmax_max)

    # compute min/max value that we can represent
    if sign:
        xmin = -xmax
    else:
        xmin = nn.Variable((1,), need_grad=False)
        xmin.d = 0.

    # broadcast variables to correct size
    d = broadcast_scalar(d, shape=x.shape)
    xmin = broadcast_scalar(xmin, shape=x.shape)
    xmax = broadcast_scalar(xmax, shape=x.shape)

    # apply fixed-point quantization
    return d * F.round(F.clip_by_value(x, xmin, xmax) / d)

xmax is a parameter learnable and used to clamp x. Then how to compute the gradient produced by F.clip_by_value ? or is the gradient of xmax produced by compressing of weights/activations? Thank you.

Answer 4 · 2022-11-18T13:14:01.000Z

Hello @PuNeal, the gradients are backpropagated by both: F.clip_by_value inside the quantizer and also the weight/activation size penalty from the loss function.

For F.clip_by_value the gradients are backpropagated according to the value of x as the function is defined as https://github.com/sony/nnabla/blob/7e9e97023ca89bf2056d7b7310c15a050ca438b6/python/src/nnabla/functions.py#L695-L729

If x < min, then maximum2 is clipping the value to min and the gradient is flowing to the min argument of F.clip_by_value (please see https://github.com/sony/nnabla/blob/master/include/nbla/function/maximum2.hpp#L41-L42).
If x > max, then minimum2 is clipping the value to max and the gradient is flowing to the max argument of F.clip_by_value (please see https://github.com/sony/nnabla/blob/master/include/nbla/function/minimum2.hpp#L41-L42).

Answer 5 · 2022-11-21T02:03:33.000Z

@TE-StefanUhlich Thanks for your reply.