oahzxl

Question

oahzxl

Edwardmark opened this issue 9 months ago · 2 comments

Thanks for your great work. Will you open-source the test code of running Megatron-SP and DeepSpeed-Ulyssess in the DSP paper?
The DSP figure 1, the shape before attention all gather is [b, t/n, s, hn, hd]，after attention reduce scatter is [b, t, s/n, hn, hd] is different, how to do residual add?
@fastalgo @eltociear @zhengzangw @FrankLeeeee @MaruyamaAya

Answer 1 · 2024-05-30T09:21:39.000Z

answered in #127

Answer 2 · 2024-06-14T03:25:42.000Z

@oahzxl so could you explain it more specifically?
The DSP figure 1, the shape before attention all gather is [b, t/n, s, hn, hd]，after attention reduce scatter is [b, t, s/n, hn, hd] is different. What do you mean "shard the t dimension before and after attention for megatron and ulysses", even shard t after attention, the shape is still [b, t/n, s/n, hn, hd] not same as before attention shape [b, t/n, s, hn, hd].