adafactor optimizer
isamu-isozaki opened this issue · 1 comments
isamu-isozaki commented
I'm planning to add in adafactor optimizer used in the official implementation. The main benefit of this over adam +adamw is that we don't need 3x the vram but I think a bit above 2x the vram of the models. I currently have the code up https://github.com/isamu-isozaki/adafactor-pytorch and after adding a triton version, I will bring a pr to here!
isamu-isozaki commented
I finished at least the python version! For triton, it seems like multiplying row matrix to column matrix to 16 so a rank of 1 needs some more ideas.