MoFHeka opened this issue a year ago · 0 comments
For example in single A100 machine. Llama2 13B training speed with TP2 DP 4 + Zero1 is more faster than FSDP.