hamidkazemi22/vit-visualization

Some questions concerning the implementation

Opened this issue · 0 comments

Hi, thanks for interesting work and beautiful visualisations.

I have looked at the implementation of the transforms in the code and some seem to contradict the main paper text.

  1. For example, ColorShift is intended to do the following operation: $\sigma x + \mu$. But according
    to the line it does the inverse operation, i.e denormalisation $( x - \mu) / \sigma$

  2. Total variation weight is $0.0005$ instead of $0.00005$ (i.e 10 times larger) according to the line.

  3. Operation done here looks a bit strange. The exclusion of the [CLS] token is clear, but in the end one for some reason restrict the batch size (0th dimension) to the min(batch_size, feature_dim), where feature dim is 4 * embed_dim for the case of transformers and then turns the resulting 1-d vector into 2-d diagonal matrix and takes the mean. Are these steps really needed or one could simply take the mean without transforming vector into matrix?

  4. Order of augmentations is different from the one in the paper according to the line $Jitter(GS(CS(x)))$ instead of $GS(CS(Jitter(x)))$. Or it doesn't affect performance much?

Thanks in advance for the response.