EgoPlan_Challenge_Team_AAILab

DPO-Finetuned Large Multi-Modal Planner with Retrieval Augmented Generation

Kwanghyeon Lee, Mina Kang, Hyungho Na, Heesun Bae, Byeonghu Na, Doyun Kwon, Seungjae Shin, Yeongmin Kim, Taewoo Kim, Seungmin Yun, and Il-Chul Moon

| [paper] |
We will upload our paper to Arxiv soon.

Overview

Our method consists of two components: Direct Preference Optimization (DPO) and Retrieval-Augmented Generation (RAG). We retrieve and add additional narration from action database using RAG, and train Multi-modal Large Language Models (MLLMs) with DPO loss.

Dataset and Model Checkpoint

Our implementation is based on EgoPlan-Bench.
We used Instruction Dataset & Corresponding Video/Image Dataset and Model Checkpoint refer to EgoPlan-Bench.
We also provide our generated Action Database and Model Checkpoint.
- Instruction Dataset & Corresponding Dataset
  - EgoPlan-Bench (Train / Valid / Test) & EpicKitchens / Ego4D Dataset
    - Instruction Dataset:
      - Train (50K): EgoPlan_IT.json
      - Valid (3K): EgoPlan_validation.json
      - Test (2K): EgoPlan_test.json
    - Video Dataset:
      - Epickitchens Dataset: EPIC-KITCHENS-100
      - Ego4D Dataset: Ego4D
  - Image-based Instructions from MiniGPT-4 (3K) & cc_sbu_align Dataset (zip file has instruction .json file and images.)
    - Instruction & Image Dataset:
      - cc_sbu_align.zip
  - Image-based Instructions from LLaVA (150K) & MS COCO 2014 Training Dataset
    - Instruction Dataset:
      - LLaVA Instruction Dataset: llava_instruct_150k.json
    - Image Dataset:
      - MS COCO 2014 Training Image Dataset: MS COCO 2014 Training Image Dataset:
  - Video-based Instructions from VideoChat (11K) & WebVid Dataset
    - Instruction Dataset:
      - Videochat Instruction Dataset: videochat_instruct_11k.json
      - Important! Since we don't get full WebVid dataset, we use revised instruction dataset file for own situation. You can download videochat_instruct_11k_revised.json
    - Video Dataset:
      - WebVid Dataset (for VideoChat Instuction): Since WebVid dataset is no longer available, we download the video dataset by two steps.
        
        Download WebVid-10M dataset information csv file.
        
        Download the video file and save it into your specific path by our provided python code. (The videos are not fully download because some of videos are not allowed to download. So we use videochat_instruct_11k_revised.json instead.)
- Model Checkpoint:
  - Original (Vanilla) Video-LLaMA: Original Video-LLaMA
  - Provided Finetuned Video-LLaMA with EgoPlan_IT dataset from EgoPlan-Bench: Finetuned Video-LLaMA (with lora weights)
  - Vision Transformer: eva_vit_g.pth (You should use Git LFS to download it.)
  - Q-Former: blip2_pretrained_flant5xxl.pth (You should use Git LFS to download it.)
  - BERT: bert-base-uncased
- Our RAG Dataset
  - You can download RAG training dataset from here and validation dataset from here.
- Our Checkpoint
  - You can download our model ckpt from here.

Egocentric Video Path Setting (EpicKitchens & Ego4D)

Since EpicKitchens and Ego4D datasets are large datasets, you need to download only necessary thing if you have limited resource. We follow path setting from EgoPlan-Bench.

Download the RGB frames of EPIC-KITCHENS-100 and videos of Ego4D. The folder structure of two datasets are shown below:

EpicKitchens Dataset:

EPIC-KITCHENS
└── P01
    └── P01_01
        ├── frame_0000000001.jpg
        └── ...

Ego4D Dataset:

Ego4D
└── v1
    ├── 000786a7-3f9d-4fe6-bfb3-045b368f7d44.mp4
    └── ...

Reproduction

We provide the file we used and setting for reproduction.

Since we have some trouble with downloading Epickitchens dataset, we also share the Epickitchens Video ID list file we used to check if there any missed video compared with original EPIC-KITCHENS-100.
You can download our model config to reproduce models in table. (DPO Finetuned model checkpoint is here with lora weights.)
- Original Video-LLaMA, RAG X, DPO loss: link
- DPO Finetuned Video-LLaMA, RAG X, DPO loss: link
- DPO Finetuned Video-LLaMA, RAG O, Cont. loss: link
- DPO Finetuned Video-LLaMA, RAG O, DPO loss: link

Retrieval-Augmented Generation (RAG) Train / Test Dataset Generation

You can generate training dataset and test dataset with additional narration from action database (EpicKitchens training dataset + Ego4D generated dataset) by RAG.

Before generating RAG Train / Test dataset, you should download our Ego4D generated action database here.
RAG Training Dataset (/RAG_train):
- download dataset for RAG training data.
- run bash code.
```
bash run.sh
```
RAG Test Dataset (/RAG_test):
- download dataset for RAG test data.
- run each code in order.
```
bash run.sh
```

Finetuning & Evaluating & Testing Commands

Before finetuning or evaluating, you need to prepare .yaml file to set configuration. If you want to train dataset with RAG, you need to change config 'datasets.datasets.egoplan_contrastive.answer_type' to "egoplan_qa_with_narr"!

1) Finetuning

bash scripts/format_finetune.sh {config} {device} {node} {master port}

bash scripts/format_finetune.sh Original_RAG_X_loss_DPO 0,1,2,3,4,5,6,7 8 26501

2) Evaluation & Test

bash scripts/format_eval.sh {config} {device} {RAG} {epoch}

bash scripts/format_test.sh {config} {device} {RAG} {epoch}

bash scripts/format_eval.sh Original_RAG_X_loss_DPO 0 True 9
bash scripts/format_test.sh Original_RAG_X_loss_DPO 0 True 9

Experimental Results

1) Test accuracies with regard to our method components

	DPO loss	Test Acc.(%)
Base →		41.35
Ours →	✔	53.98

Test accuracy 53.98% of DPO finetuned model is achived at epoch 9 (10/10).

2) Validation accuracies for various combinations of our method components

	Base	Loss type	RAG	Valid Acc.(%) / Approx. Training Time
Baseline	Original	-	-	30.44† / Given Pre-trained Model
		Contrastive	✗	44.42† / Given Pre-trained Model
Ours	Original	DPO	✗	60.24 / 0.5 days
	DPO-Finetuned	Contrastive (Iterative)	✓	46.05 / 0.5 days
		DPO (Iterative)	✗	61.11 / 0.5 days
		DPO (Iterative)	✓	60.24 / 0.5 days

Note that Base indicates the initial checkpoint from which the model is fine_tuned.

Valid accuracy 60.24% of DPO finetuned model is achived at epoch 8 (9/10).

Acknowledgement

This repository benefits from Epic-Kitchens, Ego4D, EgoPlan, Video-LLaMA, LLaMA, MiniGPT-4, LLaVA, VideoChat. Thanks for their wonderful works!

aailabkaist/EgoPlan_Challenge_Team_AAILab