adafactor optimizer

Question

adafactor optimizer

isamu-isozaki opened this issue a year ago · 1 comments

I'm planning to add in adafactor optimizer used in the official implementation. The main benefit of this over adam +adamw is that we don't need 3x the vram but I think a bit above 2x the vram of the models. I currently have the code up https://github.com/isamu-isozaki/adafactor-pytorch and after adding a triton version, I will bring a pr to here!

Answer 1 · 2023-09-16T22:28:12.000Z

I finished at least the python version! For triton, it seems like multiplying row matrix to column matrix to 16 so a rank of 1 needs some more ideas.