Pretraining Result of BridgeTower
jsg921019 opened this issue · 1 comments
Hello, I have implemented BridgeTower architecture according to the paper and this issue based on METER github.
However, I was not able to get the result that match the paper. Below is the validation epoch loss graph for BridgeTower(blue) and METER(orange), mlm and itm respectively.
MLM | ITM |
---|---|
The training graph for both models are similar, even the downstream results for VQAv2 are similar
VQAv2 test-dev | |
---|---|
METER | 77.65 |
BridgeTower | 77.64 |
This is how i implemented BridgeTower
- For ImageEncoder(CLIP) and TextEncdoer(RoBertA), change forward() so that it returns last 6 intermediate outputs instead of only last one. so we have [V0, V1, V2, V3, V4, V5], [T0, T1, T2, T3, T4, T5].
- For CLIP, these intermediate layers are permuted to be LND -> NLD and normalized with self.ln_post.
- Newly added layers are BridgeLayer with are 12 LayerNorms (6 for each modality).
- starting with
$Z^T_0= Z^V_0 = 0$ ,$\tilde{Z^V_l} = LayerNorm(Z^V_l + V_l W_V + V_{type})$ ,$\tilde{Z^T_l} = LayerNorm(Z^T_l + T_l W_T + T_{type})$ where LayerNorm is different for each layer, but projections W_V, W_T and type embedding T_type, V_type is shared. - Then
$Z^V_l ,Z^T_l =Encoder^Z_l(\tilde{Z}^V_l , \tilde{Z}^T_l )$ just as METER. - the lr for new LayerNorms are multiplied 5 times the base lr and have no weight decay.
- Rest hyperparameters are same as METER.
Is there anything wrong or anything that I missed in my implementation?? Thanks in advance.
Hello, although I don't find any mistakes in your description, I notice that the mlm_val_loss in your implementation is higher than our version (0.86~0.87). Our paper has released the pre-training and VQAv2 fine-tuning hyperparameters (Tables 10 & 11). Please check these settings and wait for our code & checkpoint release.