Question about Mixed Precision DNNs
PuNeal opened this issue · 5 comments
hello, I have a question about the implementations of CASE U3:
why the formula of calculating bitwidth is different between weight and activation?
https://github.com/sony/ai-research-code/blob/master/mixed-precision-dnns/train_resnet.py#L174
https://github.com/sony/ai-research-code/blob/master/mixed-precision-dnns/train_resnet.py#L264
Hope for reply. Thanks!
Hello, Thank you for your interest.
Actually the two lines that you mentioned refer to two different setups. Namely fixed-point parametrized with d and xmax for L174 and pow2 parametrized by x_min and x_max for L264.
The comparison should be between L174 and L252 with the same setup. The only difference is +1
in the weight computation. This corresponds to the sign bit which is needed for the sign of the weight. On the other hand, the activation do not require the sign bit when using a ReLU activation function.
Hope this answers your question.
Got it, thank you!
Hi, I'm confused about the backward process during training, as the code below:
xmax = clip_scalar(xmax, xmax_min, xmax_max)
# compute min/max value that we can represent
if sign:
xmin = -xmax
else:
xmin = nn.Variable((1,), need_grad=False)
xmin.d = 0.
# broadcast variables to correct size
d = broadcast_scalar(d, shape=x.shape)
xmin = broadcast_scalar(xmin, shape=x.shape)
xmax = broadcast_scalar(xmax, shape=x.shape)
# apply fixed-point quantization
return d * F.round(F.clip_by_value(x, xmin, xmax) / d)
xmax
is a parameter learnable and used to clamp x
. Then how to compute the gradient produced by F.clip_by_value
? or is the gradient of xmax produced by compressing of weights/activations? Thank you.
Hello @PuNeal, the gradients are backpropagated by both: F.clip_by_value
inside the quantizer and also the weight/activation size penalty from the loss function.
For F.clip_by_value
the gradients are backpropagated according to the value of x
as the function is defined as https://github.com/sony/nnabla/blob/7e9e97023ca89bf2056d7b7310c15a050ca438b6/python/src/nnabla/functions.py#L695-L729
- If
x < min
, thenmaximum2
is clipping the value tomin
and the gradient is flowing to themin
argument ofF.clip_by_value
(please see https://github.com/sony/nnabla/blob/master/include/nbla/function/maximum2.hpp#L41-L42). - If
x > max
, thenminimum2
is clipping the value tomax
and the gradient is flowing to themax
argument ofF.clip_by_value
(please see https://github.com/sony/nnabla/blob/master/include/nbla/function/minimum2.hpp#L41-L42).
@TE-StefanUhlich Thanks for your reply.