fxmeng/Pruning-Filter-in-Filter

Why get mask before optimizer.step() ?

Opened this issue · 4 comments

Thanks for your great work!
But I have some question about the mask of FilterSkeleton. That is, why should it be calculated before optimizer.step(), which will update all the params of model include FilterSkeleton.data , what if some parameters are updated less than the threshold? Will they participate in the process of forward calculation??

Looking forward to your reply!

fxmeng commented

After computing the gradients using loss.backward() method, we can update the parameters of the model using these gradients. However, some stripes may not be crucial to the model's performance, and we may wish to remove them.

To achieve this, we can apply the update_skeleton() method to set the gradients of such unimportant parameters to zero. Specifically, we mask the gradients of these parameters whose L1 norm is below a certain threshold so that they are not updated during optimization.

After masking the gradients, we apply optimizer.step() method to update the remaining important parameters.

Thanks for your reply! I guess I know the main idea of your great work, and I just wonder When to update important parameters.

I mean, is there any possibility that these important parameters will be updated below the threshold after optimizer.step() ?

In more detail, once these parameters are identified as important, they will be updated. However, if there is a parameter that is very close to the threshold, after optimizer.step(), it is updated to just a little bit below the threshold, it is no longer important.

However, it will participate in the forward calculation. Is this reasonable?

I have tried to get the mask(important parameters) after the parameter update, here is the result:

  • before:

    • Number of params: 2.39M
    • Number of FLOPs: 283.33M
    • Best accuracy: tensor(0.9398) (Reproduced acc is higher than it in the paper...)
  • after:

    • Number of params: 2.40M
    • Number of FLOPs: 275.43M
    • Best accuracy: tensor(0.9397)

Params and acc are almost the same, but FLOPs is less.

fxmeng commented

Thanks for your reply! I guess I know the main idea of your great work, and I just wonder When to update important parameters.

I mean, is there any possibility that these important parameters will be updated below the threshold after optimizer.step() ?

In more detail, once these parameters are identified as important, they will be updated. However, if there is a parameter that is very close to the threshold, after optimizer.step(), it is updated to just a little bit below the threshold, it is no longer important.

However, it will participate in the forward calculation. Is this reasonable?

Oh, I got it. You want directly pruning the parameters that will become invalid after optimizer.step(), which parameters in my implementation will be pruned in the next iteration. Your approach seems reasonable, but I would like to verify two details with you.
Firstly, did you maintain the L1 norm penalty before applying optimizer.step() like in

self.FilterSkeleton.grad.data.add_(sr * torch.sign(self.FilterSkeleton.data))

Secondly, did you deactivate the updating process for invalid stripes like in
self.FilterSkeleton.grad.data.mul_(mask)