microsoft/BridgeTower

Pretraining Result of BridgeTower

jsg921019 opened this issue · 1 comments

Hello, I have implemented BridgeTower architecture according to the paper and this issue based on METER github.

However, I was not able to get the result that match the paper. Below is the validation epoch loss graph for BridgeTower(blue) and METER(orange), mlm and itm respectively.

MLM ITM
image image

The training graph for both models are similar, even the downstream results for VQAv2 are similar

VQAv2 test-dev
METER 77.65
BridgeTower 77.64

This is how i implemented BridgeTower

  1. For ImageEncoder(CLIP) and TextEncdoer(RoBertA), change forward() so that it returns last 6 intermediate outputs instead of only last one. so we have [V0, V1, V2, V3, V4, V5], [T0, T1, T2, T3, T4, T5].
  2. For CLIP, these intermediate layers are permuted to be LND -> NLD and normalized with self.ln_post.
  3. Newly added layers are BridgeLayer with are 12 LayerNorms (6 for each modality).
  4. starting with $Z^T_0= Z^V_0 = 0$, $\tilde{Z^V_l} = LayerNorm(Z^V_l + V_l W_V + V_{type})$, $\tilde{Z^T_l} = LayerNorm(Z^T_l + T_l W_T + T_{type})$ where LayerNorm is different for each layer, but projections W_V, W_T and type embedding T_type, V_type is shared.
  5. Then $Z^V_l ,Z^T_l =Encoder^Z_l(\tilde{Z}^V_l , \tilde{Z}^T_l )$ just as METER.
  6. the lr for new LayerNorms are multiplied 5 times the base lr and have no weight decay.
  7. Rest hyperparameters are same as METER.

Is there anything wrong or anything that I missed in my implementation?? Thanks in advance.

Hello, although I don't find any mistakes in your description, I notice that the mlm_val_loss in your implementation is higher than our version (0.86~0.87). Our paper has released the pre-training and VQAv2 fine-tuning hyperparameters (Tables 10 & 11). Please check these settings and wait for our code & checkpoint release.