Issues
- 0
How to use with SSL methods like DINOv2?
#78 opened by josephcappadona - 0
MuP for RNNs
#77 opened by norikazu99 - 0
- 2
Increasing coord check for the network output
#71 opened by AkshitaB - 0
MuP for Mamba
#74 opened by norikazu99 - 3
FSDP support?
#59 opened by platers - 2
About Learning rate decay
#64 opened by afcruzs - 6
Questions for training gpt-2 using mup
#66 opened by jiangjiadi - 0
- 5
- 2
Usage with torch.compile in Pytorch 2?
#60 opened by dreavjr - 0
- 2
Should `base=None` be used in `set_base_shapes` for model used for tuning?
#25 opened by callumm-graphcore - 0
- 2
Interpreting jitter in coordcheck
#58 opened by leenachennuru - 1
- 5
- 1
Unclear `assert_hidden_size_inf` triggers
#62 opened by dreavjr - 0
dim_feedforward
#61 opened by dreavjr - 6
- 0
Some questions about the implementation of muP.
#57 opened by lepodl - 0
- 2
_rescale_parameters() inconsistent with the paper for the tied embedding scenario?
#55 opened by ofivite - 2
- 0
Once the best HPs have been found, does the final model have to be trained with `mup` or can one just use the found HPs and train the model in a standard way?
#53 opened by ricomnl - 0
Warmup schedule when changing the number of tokens/steps (GPT-3 experiment detail)
#51 opened by sashaDoubov - 2
- 2
Does mup support fine tuning pretrained models
#46 opened by jhj0411jhj - 2
- 2
interpreting coord checks
#42 opened by llucid-97 - 1
in mlp example: 2 problems
#41 opened by yjjinjie - 12
- 3
Can base model be larger than target model?
#39 opened by jhj0411jhj - 2
- 2
Does mup support Swin Transformer v2 model?
#21 opened by shiyf129 - 2
Batch size, Seq len, Step Transfering
#24 opened by timothyxp - 4
Has MuP been tested on segmentation models?
#26 opened by isdj - 3
Finetuning a Pretrained Model Using MuP
#31 opened by zanussbaum - 5
- 2
LayerNorm Gain and Bias Multipliers
#28 opened by AWildridge - 5
- 8
- 20
Conv1D Coord check looks good (I think), but μTransfer does not seem to work?
#23 opened by zanussbaum - 1
- 2
muP for contrastive losses
#20 opened by xwjabc - 17
Coord-check for conv1d
#14 opened by bob80333 - 5
mu parametrization for channel attention
#18 opened by xwjabc - 3
- 2
Optimizers for coord check
#16 opened by xwjabc - 2
ResNet readout_zero_init=True?
#13 opened by D-X-Y