/EgoPlan_Challenge_Team_AAILab

Primary LanguagePythonApache License 2.0Apache-2.0

EgoPlan_Challenge_Team_AAILab

DPO-Finetuned Large Multi-Modal Planner with Retrieval Augmented Generation

Kwanghyeon Lee, Mina Kang, Hyungho Na, Heesun Bae, Byeonghu Na, Doyun Kwon, Seungjae Shin, Yeongmin Kim, Taewoo Kim, Seungmin Yun, and Il-Chul Moon

| [paper] |
We will upload our paper to Arxiv soon.

Overview

Teaser image

Our method consists of two components: Direct Preference Optimization (DPO) and Retrieval-Augmented Generation (RAG). We retrieve and add additional narration from action database using RAG, and train Multi-modal Large Language Models (MLLMs) with DPO loss.

Dataset and Model Checkpoint

Egocentric Video Path Setting (EpicKitchens & Ego4D)

Since EpicKitchens and Ego4D datasets are large datasets, you need to download only necessary thing if you have limited resource. We follow path setting from EgoPlan-Bench.

Download the RGB frames of EPIC-KITCHENS-100 and videos of Ego4D. The folder structure of two datasets are shown below:

  • EpicKitchens Dataset:
    EPIC-KITCHENS
    └── P01
        └── P01_01
            ├── frame_0000000001.jpg
            └── ...
    
  • Ego4D Dataset:
    Ego4D
    └── v1
        ├── 000786a7-3f9d-4fe6-bfb3-045b368f7d44.mp4
        └── ...
    

Reproduction

We provide the file we used and setting for reproduction.

  • Since we have some trouble with downloading Epickitchens dataset, we also share the Epickitchens Video ID list file we used to check if there any missed video compared with original EPIC-KITCHENS-100.
  • You can download our model config to reproduce models in table. (DPO Finetuned model checkpoint is here with lora weights.)
    • Original Video-LLaMA, RAG X, DPO loss: link
    • DPO Finetuned Video-LLaMA, RAG X, DPO loss: link
    • DPO Finetuned Video-LLaMA, RAG O, Cont. loss: link
    • DPO Finetuned Video-LLaMA, RAG O, DPO loss: link

Retrieval-Augmented Generation (RAG) Train / Test Dataset Generation

You can generate training dataset and test dataset with additional narration from action database (EpicKitchens training dataset + Ego4D generated dataset) by RAG.

  • Before generating RAG Train / Test dataset, you should download our Ego4D generated action database here.
  • RAG Training Dataset (/RAG_train):
    • download dataset for RAG training data.
    • run bash code.
    bash run.sh
  • RAG Test Dataset (/RAG_test):
    • download dataset for RAG test data.
    • run each code in order.
    bash run.sh

Finetuning & Evaluating & Testing Commands

Before finetuning or evaluating, you need to prepare .yaml file to set configuration. If you want to train dataset with RAG, you need to change config 'datasets.datasets.egoplan_contrastive.answer_type' to "egoplan_qa_with_narr"!

1) Finetuning

  • run
bash scripts/format_finetune.sh {config} {device} {node} {master port}
  • Ex.
bash scripts/format_finetune.sh Original_RAG_X_loss_DPO 0,1,2,3,4,5,6,7 8 26501

2) Evaluation & Test

  • run
bash scripts/format_eval.sh {config} {device} {RAG} {epoch}
bash scripts/format_test.sh {config} {device} {RAG} {epoch}
  • Ex.
bash scripts/format_eval.sh Original_RAG_X_loss_DPO 0 True 9
bash scripts/format_test.sh Original_RAG_X_loss_DPO 0 True 9

Experimental Results

1) Test accuracies with regard to our method components

DPO loss Test Acc.(%)
Base → 41.35
Ours → 53.98
  • Test accuracy 53.98% of DPO finetuned model is achived at epoch 9 (10/10).

2) Validation accuracies for various combinations of our method components

Base Loss type RAG Valid Acc.(%) / Approx. Training Time
Baseline Original - - 30.44† / Given Pre-trained Model
Contrastive 44.42† / Given Pre-trained Model
Ours Original DPO 60.24 / 0.5 days
DPO-Finetuned Contrastive (Iterative) 46.05 / 0.5 days
DPO (Iterative) 61.11 / 0.5 days
DPO (Iterative) 60.24 / 0.5 days

Note that Base indicates the initial checkpoint from which the model is fine_tuned.

  • Valid accuracy 60.24% of DPO finetuned model is achived at epoch 8 (9/10).

Acknowledgement

This repository benefits from Epic-Kitchens, Ego4D, EgoPlan, Video-LLaMA, LLaMA, MiniGPT-4, LLaVA, VideoChat. Thanks for their wonderful works!