Using sparse learning in practice
iamanigeeit opened this issue · 1 comments
iamanigeeit commented
Hi Tim, thanks for making this library. I am trying to test it on speech generation models and i have some questions from your code template:
- The models come with their own schedulers and optimizers. Can i simply wrap them around with
decay = CosineDecay ...
andmask = Masking(optimizer, ...)
? Should i change the optimizer to followoptim.SGD(...)
and ignore the scheduler? It looks likemask.step()
runs every epoch and replaces the scheduler, but i think i should still keep the optimizer specific to the model i have. - I understand that density/sparsity is the desired % of weights to keep, while prune/death rate is an internal parameter to determine what % weights should be redistributed at each iteration. Is this correct?
- Density looks like = sparsity in your code, although normally i would think density = 1 - sparsity.
- Code fails at
core.py
line 221-223 when there are RNNs, because for thembias
is a boolean and the bias terms are actuallybias_ih
andbias_hh
. I think this might count the parameters better:
for p, tensor in self.modules[0].named_parameters():
total_size += tensor.numel()
TimDettmers commented
Hi! Thanks for your questions.
- The mask scheduler is different from the learning rate scheduler. The learning rate scheduler should be unaffected by the code.
- That is correct. The sparsity percentage is kept steady, but the prune rate changes over time.
- I think this is correct. For me, it feels more natural to think in terms of density (27% of weights seems more intuitive than 73% sparsity). However, I think I keep the naming in the code as "sparsity" even though I used density conceptually
- This is a good catch! Could you create a pull request for this? I did not test the code for RNNs