See the original repo for detailed instructions and context. 🦁 Lion - Pytorch implementation courtesy of lucidrains.
So 🦁 is definitely better than AdamW: lower loss and higher iter/s, but somehow (fused) AdamW burns less power and spends less time accessing memory, presumably thanks to the highly-optimized fused kernel...?