Can't get the accuracy of AVE reported in the paper with vit_base(75.3%)

Hi,
We used this config to train AVE task on a 3090, and we used the procesed data you provided, but the accuracy we got is 73.31

python3 /code/AVE/main_trans.py --Adapter_downsample=8 --batch_size=4 --early_stop=5 --epochs=50 --is_audio_adapter_p1=1 --is_audio_adapter_p2=1 --is_audio_adapter_p3=0 --is_before_layernorm=1 --is_bn=1 --is_fusion_before=1 --is_gate=1 --is_post_layernorm=1 --is_vit_ln=0 --lr=5e-06 --lr_mlp=4e-06 --mode=train --num_conv_group=2 --num_tokens=2 --num_workers=8 --is_multimodal=1 --vis_encoder_type=vit

And the

LAVISH/AVE/nets/net_trans.py

Line 435 in 97722b0

audio = audio[0]

doesn't used in forward_swin, this code will cause a "shape don't match" error.

Thanks for pointing out. Can you try these hyper-parameters? I used different parameters for ViTs and Swin

--batch_size=2 --early_stop=5 --epochs=50 --is_audio_adapter_p1=1 --is_audio_adapter_p2=1 --is_audio_adapter_p3=0 --is_before_layernorm=0 --is_bn=0 --is_fusion_before=1 --is_gate=1 --is_post_layernorm=0 --is_vit_ln=1 --lr=3e-05 --lr_mlp=6e-06 --mode=train --model=MMIL_Net --num_conv_group=4 --num_tokens=8

Thanks for your reply. I have tried the hyper-parameters you provided, and the accuracy have achieved 75.2%.
It's interesting that the total params under this setting is 105.5M, which is less than 107.2M mentioned is your paper.

@Lecooo May I know how many epochs it took to achieve this result thanks