bghira/SimpleTuner

dadaptive lion?

Closed this issue · 7 comments

coming over here from OneTrainer.
It supports LION and DADAPT-LION.
You supportt LION, and... quantized lion?
What about adaptive lion?

it would need to support stochastic rounding and or kahan summation. furthermore any new optimisers need to be proven in a toy ViT training session against AdamW BF16 and Adam fp32. if you are interested please open a pull request with this data.

fwiw, D-adaptation can only hit 80% accuracy on CIFAR-10 after 20 epochs. this doesn't bode well for the optimiser's performance, especially versus a more robust option like Adam(W).

i know that it isn’t good at long runs because it over adapts after maybe 100,000 steps.

my use case is to do two runs of training.
first run is with dadpt to find a reasonable LR value. it is an interrupted run.

then i do a full run with regular lion using that value.

image

the above is often the LR curve for dadapt. I'm grabbing the value from the first plateau

bghira commented

learning rates don't really apply across optims this way so i'm not sure that is a useful approach still

Eh... im sure there will still be some differences, but at least they are still both lion.
This strategy is supposed to work for ADAMW, so why not lion?

Quote from web search, "Yes, you can use "Dadapt AdamW" to find a good learning rate for AdamW"

bghira commented

as mentioned before if you want to see this in simpletuner you'll have to provide the implementation with the requirements met as well as data indicating its bf16 training effectiveness against Adam(W) in fp32 and bf16 w/ stochastic rounding.