MengLcool/AdaViT

A question about Gumbel softmax in practice

TomerRonen34 opened this issue · 1 comments

At training time, once you apply Gumbel softmax on the patch selection logits, you have 2 numbers for each patch - the probability of keeping it, and the probability of discarding it.
Do you actually drop patches during training time? (not sure if that's differentiable since it requires argmax)
Do you multiply the patch embeddings by the probability of keeping them? Something else?
Thanks!

Thank you for your interest in our work! I am happy to answer the questions:

  1. We do not drop patches during training for efficient parallel computing. Instead, we multiply the patches with the decision vector, where 0 means to drop and 1 means to keep. Besides, we also mask the position of dropped patches in the self-attention logits to guarantee that the dropped patches will not participate in the calculations.
  2. We apply Gumbel-Softmax to generate the decision vector, which is a {0, 1}^N vector and is differentiable. Then the decision vector (not the keeping probability) is multiplied with patches as mentioned above so that the gradient can backpropagate to the decision network. You can refer to the pytorch official implementation of gumbel_softmax for more details.