This repo is the official Pytorch
implementation of the paper:
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu, Libo Qin, Wanxiang Che, Min-Yen Kan.
Under Review.
Note
This paper is an extension of the following paper:
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Xiao Xu, Bei Li, Chenfei Wu, Shao-Yen Tseng, Anahita Bhiwandiwalla, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan.
ACL 2023 (Oral) | Association for Computational Linguistics
Paper | Arxiv | Model | Slides | Video(EN) | Video(CN) | Blog(CN) | Tweet(EN)
Two-Tower Vision--Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it (i) suffers from ineffective layer-by-layer utilization of unimodal representations, (ii) restricts the flexible exploitation of different levels of unimodal semantic knowledge, and (iii) is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. No matter with or without VL Pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance.
We provide four model checkpoints for reproducing our results, including:
- Two reproduced versions of LLaVA-OneVision-Qwen2-0.5b-si, indicated as
LLaVA-OV
for short. - Two versions of our LLaVA-OneVision-Manager, indicated as
LLaVA-OV-Manager
for short.
You can download them from here.
- Baseline:
LLaVA-OV
without the Multi-Grid algorithm during training. - Baseline + Manager:
LLaVA-OV-Manager
without the Multi-Grid algorithm during training. - Baseline + Grid:
LLaVA-OV
with the Multi-Grid algorithm during training, which can be seen as a reproduction of the original LLaVA-OneVision-Qwen2-0.5b-si. - Baseline + Grid + Manager:
LLaVA-OV-Manager
with the Multi-Grid algorithm during training.
- Run
setup.sh
to set up the environment, prepare the pre-trained models, data and our checkpoints. - [Optional] We use wandb to track experiments! Please remember to run
wandb login
and paste your token before running the script.
cd LLaVA-NeXT
# Stage 1 for Baseline and Baseline + Grid
bash scripts/manager/main/stage1.sh
# Stage 1 for Baseline + Manager and Baseline + Grid + Manager
bash scripts/manager/main/stage1_manager.sh
# Baseline
bash scripts/manager/main/stage1.5_nogrid.sh
bash scripts/manager/main/stage2_nogrid.sh
# Baseline + Manager
bash scripts/manager/main/stage1.5_manager_nogrid.sh
bash scripts/manager/main/stage2_manager_nogrid.sh
# Baseline + Grid
bash scripts/manager/main/stage1.5.sh
bash scripts/manager/main/stage2.sh
# Baseline + Grid + Manager
bash scripts/manager/main/stage1.5_manager.sh
bash scripts/manager/main/stage2_manager.sh
cd lmms-eval
# For non-API tasks
bash lmms-eval/scripts/llava_ov_non_api_group.sh 8 # the number of GPUs to use
bash lmms-eval/scripts/llava_ov_reproduce_non_api_group.sh 8 "Baseline" # the number of GPUs to use, CHECKPOINT_NAME
bash lmms-eval/scripts/llava_ov_reproduce_non_api_group.sh 8 "Baseline + Grid"
bash lmms-eval/scripts/llava_ov_manager_non_api_group.sh 8 "Baseline + Manager"
bash lmms-eval/scripts/llava_ov_manager_non_api_group.sh 8 "Baseline + Grid + Manager"
# For API tasks
## evaluate one by one, and predict first (in a server maybe without network)), then evaluate on a server with network (do not need GPU)
### take the first task as an example, index \in [0, 6]
bash lmms-eval/scripts/llava_ov_api_predict_eval_from_log.sh 8 0 0 # the number of GPUs to use, the index of the task, 0 for predict only, 1 for evaluate only
bash lmms-eval/scripts/llava_ov_api_predict_eval_from_log.sh 8 0 1
bash lmms-eval/scripts/llava_ov_reproduce_api_predict_eval_from_log.sh 8 "Baseline" 0 0 # the number of GPUs to use, CHECKPOINT_NAME, the index of the task, 0 for predict only, 1 for evaluate only
bash lmms-eval/scripts/llava_ov_reproduce_api_predict_eval_from_log.sh 8 "Baseline" 0 1
bash lmms-eval/scripts/llava_ov_reproduce_api_predict_eval_from_log.sh 8 "Baseline + Grid" 0 0
bash lmms-eval/scripts/llava_ov_reproduce_api_predict_eval_from_log.sh 8 "Baseline + Grid" 0 1
bash lmms-eval/scripts/llava_ov_manager_api_predict_eval_from_log.sh 8 "Baseline + Manager" 0 0
bash lmms-eval/scripts/llava_ov_manager_api_predict_eval_from_log.sh 8 "Baseline + Manager" 0 1
bash lmms-eval/scripts/llava_ov_manager_api_predict_eval_from_log.sh 8 "Baseline + Grid + Manager" 0 0
bash lmms-eval/scripts/llava_ov_manager_api_predict_eval_from_log.sh 8 "Baseline + Grid + Manager" 0 1
## direct evaluate all tasks
### take the evaluation of the original model as an example
bash lmms-eval/scripts/llava_ov_api_direct.sh 8
This code repository is highly based on LLaVA-One-Vision, lmms-eval, and ManagerTower.