NVlabs/RADIO

RADIO with lora

Opened this issue · 4 comments

I used RADIO-L as the visual encoder for LLaVA, and added LoRA to RADIO-L in both the pretraining and finetuning stages. However, we found the following two intriguing conclusions:

  1. RADIO-L uses 768 resolution and center crop to encode images, and the results of the trained LLaVA model on evaluation sets like MMBench are similar to those of LLaVA1.5 with CLIP-L-336.
  2. When RADIO-L uses 336 resolution or even smaller resolutions like 224 and center crop to encode images, the training of LLaVA is more likely to experience a sudden increase in loss, leading to abnormal training results.

I'm not certain whether the issue is caused by RADIO-L's sensitivity to resolution, or the way RADIO-L is integrated with LoRA. I am looking forward to discussing this in more depth with you.

The detailed parameters of RADIO-L integrated with LoRA in our experiments are as follows:
visual_encoder.base_model.model.radio_model.model.blocks.0.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.0.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.0.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.0.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.0.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.0.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.0.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.0.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.1.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.1.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.1.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.1.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.1.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.1.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.1.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.1.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.2.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.2.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.2.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.2.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.2.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.2.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.2.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.2.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.3.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.3.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.3.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.3.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.3.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.3.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.3.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.3.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.4.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.4.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.4.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.4.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.4.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.4.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.4.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.4.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.5.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.5.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.5.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.5.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.5.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.5.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.5.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.5.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.6.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.6.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.6.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.6.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.6.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.6.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.6.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.6.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.7.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.7.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.7.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.7.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.7.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.7.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.7.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.7.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.8.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.8.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.8.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.8.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.8.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.8.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.8.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.8.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.9.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.9.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.9.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.9.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.9.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.9.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.9.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.9.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.10.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.10.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.10.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.10.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.10.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.10.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.10.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.10.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.11.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.11.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.11.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.11.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.11.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.11.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.11.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.11.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.12.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.12.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.12.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.12.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.12.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.12.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.12.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.12.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.13.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.13.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.13.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.13.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.13.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.13.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.13.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.13.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.14.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.14.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.14.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.14.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.14.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.14.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.14.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.14.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.15.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.15.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.15.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.15.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.15.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.15.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.15.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.15.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.16.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.16.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.16.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.16.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.16.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.16.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.16.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.16.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.17.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.17.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.17.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.17.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.17.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.17.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.17.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.17.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.18.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.18.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.18.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.18.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.18.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.18.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.18.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.18.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.19.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.19.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.19.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.19.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.19.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.19.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.19.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.19.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.20.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.20.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.20.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.20.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.20.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.20.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.20.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.20.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.21.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.21.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.21.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.21.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.21.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.21.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.21.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.21.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.22.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.22.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.22.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.22.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.22.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.22.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.22.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.22.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.23.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.23.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.23.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.23.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.23.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.23.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.23.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.23.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.patch_generator.embedder.lora_A.default.weight torch.Size([64, 768])
visual_encoder.base_model.model.radio_model.model.patch_generator.embedder.lora_B.default.weight torch.Size([1024, 64])

Hello, how are you pre-processing inputs into RADIO-L? Is the data passed as RGB values in a [0,1] range?

Hello, how are you pre-processing inputs into RADIO-L? Is the data passed as RGB values in a [0,1] range?

We initialized the image preprocessor in the following manner:

dynamic_image_processor = dict(
    type=CLIPImageProcessor.from_pretrained,
    pretrained_model_name_or_path=visual_encoder_name_or_path,
    trust_remote_code=True,
    do_resize=True,
    do_center_crop=True
)

And this is an example of the input image which we fed to the RADIO model.

tensor([[[[0.3412, 0.1765, 0.2118,  ..., 0.4275, 0.4353, 0.0863],
          [0.3255, 0.4980, 0.3176,  ..., 0.5804, 0.4314, 0.3608],
          [0.6784, 0.6275, 0.5020,  ..., 0.4510, 0.2863, 0.2353],
          ...,
          [0.5020, 0.4784, 0.4941,  ..., 0.6118, 0.6314, 0.6706],
          [0.4784, 0.4824, 0.5255,  ..., 0.6000, 0.6549, 0.6706],
          [0.4706, 0.4824, 0.4980,  ..., 0.6078, 0.6745, 0.6471]],

         [[0.7098, 0.5608, 0.5843,  ..., 0.7686, 0.9020, 0.7804],
          [0.6392, 0.8118, 0.6471,  ..., 0.9608, 0.9176, 0.9647],
          [0.8941, 0.8196, 0.7529,  ..., 0.8863, 0.7647, 0.6980],
          ...,
          [0.4549, 0.4353, 0.4510,  ..., 0.6275, 0.6392, 0.6784],
          [0.4314, 0.4353, 0.4784,  ..., 0.6157, 0.6627, 0.6824],
          [0.4235, 0.4353, 0.4510,  ..., 0.6235, 0.6824, 0.6588]],

         [[0.6588, 0.4706, 0.4667,  ..., 0.7843, 0.8980, 0.7373],
          [0.5961, 0.7373, 0.5333,  ..., 0.9608, 0.9059, 0.9294],
          [0.8667, 0.7647, 0.6627,  ..., 0.8667, 0.7373, 0.6706],
          ...,
          [0.4078, 0.3882, 0.4039,  ..., 0.7059, 0.7216, 0.7608],
          [0.3882, 0.3922, 0.4314,  ..., 0.6941, 0.7412, 0.7608],
          [0.3804, 0.3882, 0.4078,  ..., 0.7059, 0.7647, 0.7373]]]])

Hello, yes the dynamic range of your inputs looks OK indeed. In our LLaVA1.5 experiments we don't do a center crop. Instead we resize the image such that the longest edge becomes 768 long, keeping the aspect ratio of the input image, and padding the shortest edge to the nearest multiple of 16 pixels. We did not evaluate the model on MMBench however our results on TextVQA, VQAv2, GQA, POPE were very much in favor of RADIO-L (see the README at the root of this repository). We didn't use LoRA but we kept RADIO-L frozen, training only the projector and LLM.