weights normalization

Question

weights normalization

mnikitin opened this issue 2 years ago · 4 comments

Hello!

What's the reason to skip gradient computation during normalization of classifier weights?

with torch.no_grad():
     self.w.data = F.normalize(self.w.data, dim=0)

You've implemented all sphere losses in this way, so it's not a typo :)
I think I'm missing something in your implementation; could you explain it?

Thanks you!

Answer 1 · 2022-05-25T08:27:21.000Z

Hello @mnikitin,
This is not an answer, I have a question too :))
To my understanding, weight, geometrically, is the class center. But why it is always initialized from some distributions

self.w = nn.Parameter(torch.Tensor(feat_dim, num_class)) 
nn.init.xavier_normal_(self.w)

and never update weight again?

Answer 2 · 2022-05-25T10:50:09.000Z

@tungdop2 hello !

Xavier_normal is one of default choices to initialize conv and dense layers.
Also, in authors' implementation weights of classifier are actually updating: https://github.com/ydwen/opensphere/blob/main/runner.py#L98

So, it looks that it's ok from this perspective.
But I'm still not sure about grad-skipping during normalization.

Answer 3 · 2022-05-26T09:10:13.000Z

Tks @mnikitin,
To your questions, I think we just optimize weight after normalizing and don't need to denormalize it. Its original value doesn't need to change, we don't care about it.
Back to my question, so weight is updated by gradient descent? I don't really understand this step. Another version of insight face makes me so confused.
https://github.com/deepinsight/insightface/blob/149ea0ffae5cda765102bd7c2d28e27429f828e8/recognition/arcface_torch/partial_fc.py#L138

Answer 4 · 2023-01-10T09:26:05.000Z

@tungdop2 hello,
First optimizer will change weight value. no_grad on normalizing only skip grad for normalizing.
Fllowing is reason of using no_grad:
Weight after normalized is class center. During Training, weight vector size tend to be bigger. Without no_grad, weight vector size will be bigger and bigger, although It don't affect weight value normalized. But It cause network instability.