Reconcile S4 Optimizer w/ Original Implementation

Question

siddk opened this issue 3 years ago · 0 comments

Specifically:

Fixed small learning rates for state space matrices, with no weight decay (we do not respect this with current AdamW).
Larger learning rates & weight decay for other parameters.