why the second dimension is n*b?

adapt-image-models/mmaction/models/backbones/vit_clip.py

Line 80 in 4da311e

xt = rearrange(x, 'n (b t) d -> t (b n) d', t=self.num_frames)

Hi, it is combining the spatial dimension with the batchsize dimension, and do self-attention on the temporal dimension in the following self-attention layer.

why not (b n)* t*d?

because the self-attention is applie to the first dimension.

Thank you for your reply! It turns out I was careless in looking up the API definition.