Some questions about Routing Strategy: Soft vs Discrete

Question

Some questions about Routing Strategy: Soft vs Discrete

pierowu opened this issue 9 months ago · 1 comments

Thank you for your enlightening work in the paper !

I have a question about the routing strategy. The paper says:

'Note that, although the computation is conditional to the top-k experts, the required memory depends on the total number of experts.'

,which seems to imply that discrete routing strategy has no superiority comparing with soft merging in terms of memory cost.

But as far as I know, although the memory depends on the total number of experts, the discrete routing strategy can still save memory because we don't need to store the gradients and the activations of experts which are not activated.

If we take the above issue into accounts, it seems to be unfair to just compare the number of trainable prams among different peft method. Because the number of params can't equal to the memory exactly.

Could you give some insights about how to calculate the memory cost in moe situation? And how to compare different methods fairly?

Thank you for your reply!

Answer 1 · 2024-01-08T05:20:48.000Z

The straightforward way is to compare the GPU Memory used I think. But it varies as the platform changes or even the torch version.