TimDettmers/sparse_learning

Using sparse learning in practice

iamanigeeit opened this issue · 1 comments

Hi Tim, thanks for making this library. I am trying to test it on speech generation models and i have some questions from your code template:

  1. The models come with their own schedulers and optimizers. Can i simply wrap them around with decay = CosineDecay ... and mask = Masking(optimizer, ...)? Should i change the optimizer to follow optim.SGD(...) and ignore the scheduler? It looks like mask.step() runs every epoch and replaces the scheduler, but i think i should still keep the optimizer specific to the model i have.
  2. I understand that density/sparsity is the desired % of weights to keep, while prune/death rate is an internal parameter to determine what % weights should be redistributed at each iteration. Is this correct?
  3. Density looks like = sparsity in your code, although normally i would think density = 1 - sparsity.
  4. Code fails at core.py line 221-223 when there are RNNs, because for them bias is a boolean and the bias terms are actually bias_ih and bias_hh. I think this might count the parameters better:
for p, tensor in self.modules[0].named_parameters():
    total_size += tensor.numel()

Hi! Thanks for your questions.

  1. The mask scheduler is different from the learning rate scheduler. The learning rate scheduler should be unaffected by the code.
  2. That is correct. The sparsity percentage is kept steady, but the prune rate changes over time.
  3. I think this is correct. For me, it feels more natural to think in terms of density (27% of weights seems more intuitive than 73% sparsity). However, I think I keep the naming in the code as "sparsity" even though I used density conceptually
  4. This is a good catch! Could you create a pull request for this? I did not test the code for RNNs