RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

Authors: Jaehong Yoon, Shoubin Yu, Mohit Bansal

University of North Carolina at Chapel Hill

Setup

Install Dependencies

(Optional) Creating conda environment

conda create -n RACCooN python=3.10.13
conda activate RACCooN

build from source

pip install -r requirements.txt

Download Models & Data

Video Data

Our VPLM dataset is based on ROVI videos, please refer to ROVI project page to download raw videos and inpainted videos.

Pre-trained Models

Visual Encoder: we adopt pre-trained ViT-G (1B), the codebase downloads the model automatically.

Video-LLM: we build our MLLM base on PG-Video-LLaVA, please refer to the project homepage to setup the Video-LLM

Diffusion Model: we fine-tune our video inpainting model based on StabelDiffusion2.0-inpainting, please download the model to further finetune the model as described in our paper.

V2P Fine-tuned Models

Dataset	Types
VPLM	Multi-object Description
VPLM	Single-Object Description
VPLM	Layout-Prediction

P2V Fine-tuned Models

Dataset	Types
VPLM	Video Generation

Dataset Preparation & Feature Extraction

We test our model on:

VPLM: we follow ROVI data.

Stage 1: Training and Inference (Video-to-Paragraph)

We provide RACCooN training and inference script examples as follows.

1) Training

cd v2p

sh scripts/v2p/finetune/vplm.sh

2) Inference

cd v2p

sh scripts/v2p/inference/vlpm.sh

Stage 2: Training and Inference (Paragraph-to-Video)

We provide RACCooN training and inference script examples as follows. Our code is buit upon MGIE. Please setup envoriment following MGIE instruction.

1) Training

cd p2v

sh train.sh

2) Inference

we provide jupynote scripts for P2V inference.

TODO

Release ckpts and VPLM dataset
Video Feature Extraction Examples
V2P and P2V training
Incorporate Grounding Modules

Acknowledgments

The code is built upon PG-Video-LLaVA, MGIE, GroundingDino, and LGVI.

Reference

Please cite our paper if you use our models in your works:

@article{yoon2024raccoon,
  title={RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives},
  author={Yoon, Jaehong and Yu, Shoubin and Bansal, Mohit},
  journal={arXiv preprint arXiv:2405.18406},
  year={2024}
}

jaehong31/RACCooN

RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

Authors: Jaehong Yoon*, Shoubin Yu*, Mohit Bansal

University of North Carolina at Chapel Hill

Setup

Install Dependencies

Download Models & Data

Video Data

Pre-trained Models

V2P Fine-tuned Models

P2V Fine-tuned Models

Dataset Preparation & Feature Extraction

Stage 1: Training and Inference (Video-to-Paragraph)

1) Training

2) Inference

Stage 2: Training and Inference (Paragraph-to-Video)

1) Training

2) Inference

TODO

Acknowledgments

Reference

Authors: Jaehong Yoon, Shoubin Yu, Mohit Bansal