NUS-HPC-AI-Lab/VideoSys

oahzxl

Edwardmark opened this issue · 2 comments

Thanks for your great work. Will you open-source the test code of running Megatron-SP and DeepSpeed-Ulyssess in the DSP paper?
The DSP figure 1, the shape before attention all gather is [b, t/n, s, hn, hd],after attention reduce scatter is [b, t, s/n, hn, hd] is different, how to do residual add?
@fastalgo @eltociear @zhengzangw @FrankLeeeee @MaruyamaAya

answered in #127

@oahzxl so could you explain it more specifically?
The DSP figure 1, the shape before attention all gather is [b, t/n, s, hn, hd],after attention reduce scatter is [b, t, s/n, hn, hd] is different. What do you mean "shard the t dimension before and after attention for megatron and ulysses", even shard t after attention, the shape is still [b, t/n, s/n, hn, hd] not same as before attention shape [b, t/n, s, hn, hd].