/R2-Tuning

๐ŸŒ€ R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

$\boldsymbol{R^2}$-Tuning

arXiv License Hugging Face Spaces

Installation | Dataset | Training | Evaluation | Model Zoo

This repository maintains the official implementation of the paper $\boldsymbol{R^2}$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding by Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, and Chang Wen Chen.

๐Ÿ”ฅ News

  • [2024.7.2] Our paper has been accepted by ECCV 2024.
  • [2024.6.16] Check out our online demo on ๐Ÿค— Hugging Face Spaces.
  • [2024.6.15] Add support for single video inference.
  • [2024.4.16] Code and dataset release.
  • [2024.3.31] Our tech report is available on arXiv.

๐Ÿ”จ Installation

Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.

  • CUDA 12.1
  • FFmpeg 6.0
  • Python 3.12.2
  • PyTorch 2.2.1
  • NNCore 0.4.2

Install from source

  1. Clone the repository from GitHub.
git clone https://github.com/yeliudev/R2-Tuning.git
cd R2-Tuning
  1. Initialize conda environment.
conda create -n r2-tuning python=3.12 -y
conda activate r2-tuning
  1. Install dependencies.
pip install -r requirements.txt

๐Ÿ”– Dataset

Option 1 [Recommended]: Download pre-extracted features from HuggingFace Hub directly.

# Prepare datasets in one command
bash tools/prepare_data.sh

Option 2: Reproduce our data pre-processing pipeline.

  1. Download videos from the following links and place them into data/{dataset}/videos.
  1. Extract and compress video frames at a fixed frame rate.
# For QVHighlights, Ego4D-NLQ, TACoS, and TVSum
python tools/extract_frames.py <path-to-videos>

# For Charades-STA
python tools/extract_frames.py <path-to-videos> --fps 1.0

# For YouTube Highlights
python tools/extract_frames.py <path-to-videos> --anno_path data/youtube/youtube_anno.json
Arguments of tools/extract_frames.py
  • video_dir Path to the videos folder
  • --anno_path Path to the annotation file (only for YouTube Highlights to compute frame rates)
  • --frame_dir Path to the output extracted frames
  • --size Side length of the cropped video frames
  • --fps Frame rate to be used
  • --max_len The maximum length of each video segment
  • --workers Number of processes
  • --chunksize The chunk size for each process
  1. Extract features from video frames.
python tools/extract_feat.py <path-to-anno> <path-to-frames>
Arguments of tools/extract_feat.py
  • anno_path Path to the annotation file
  • frame_dir Path to the extracted frames
  • --video_feat_dir Path to the output video features
  • --query_feat_dir Path to the output query features
  • --arch CLIP architecture to use (ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14-336px)
  • --k Save the last k layers features
  • --batch_size The batch size to use
  • --workers Number of workers for data loader

The prepared dataset should be in the following structure.

R2-Tuning
โ”œโ”€โ”€ configs
โ”œโ”€โ”€ datasets
โ”œโ”€โ”€ models
โ”œโ”€โ”€ tools
โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ qvhighlights
โ”‚   โ”‚   โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚   โ”‚   โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚   โ”‚   โ””โ”€โ”€ qvhighlights_{train,val,test}.jsonl
โ”‚   โ”œโ”€โ”€ ego4d
โ”‚   โ”‚   โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚   โ”‚   โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚   โ”‚   โ””โ”€โ”€ nlq_{train,val}.jsonl
โ”‚   โ”œโ”€โ”€ charades
โ”‚   โ”‚   โ”œโ”€โ”€ frames_224_1.0fps (optional)
โ”‚   โ”‚   โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚   โ”‚   โ””โ”€โ”€ charades_{train,test}.jsonl
โ”‚   โ”œโ”€โ”€ tacos
โ”‚   โ”‚   โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚   โ”‚   โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚   โ”‚   โ””โ”€โ”€ {train,val,test}.jsonl
โ”‚   โ”œโ”€โ”€ youtube
โ”‚   โ”‚   โ”œโ”€โ”€ frames_224_auto (optional)
โ”‚   โ”‚   โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚   โ”‚   โ””โ”€โ”€ youtube_anno.json
โ”‚   โ””โ”€โ”€ tvsum
โ”‚       โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚       โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚       โ””โ”€โ”€ tvsum_anno.json
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ setup.cfg
โ””โ”€โ”€ ยทยทยท

๐Ÿ”ฎ Training

Use the following commands to train a model with a specified config.

# Single GPU
python tools/launch.py <path-to-config>

# Multiple GPUs on a single node (elastic)
torchrun --nproc_per_node=<num-gpus> tools/launch.py <path-to-config>

# Multiple GPUs on multiple nodes (slurm)
srun <slurm-args> python tools/launch.py <path-to-config>
Arguments of tools/launch.py
  • config The config file to use
  • --checkpoint The checkpoint file to load from
  • --resume The checkpoint file to resume from
  • --work_dir Working directory
  • --eval Evaluation only
  • --dump Dump inference outputs
  • --seed The random seed to use
  • --amp Whether to use automatic mixed precision training
  • --debug Debug mode (detect nan during training)
  • --launcher The job launcher to use

Please refer to the configs folder for detailed settings of each model.

๐Ÿ† Evaluation

Use the following command to test a model and evaluate results.

python tools/launch.py <path-to-config> --checkpoint <path-to-checkpoint> --eval

For QVHighlights, you may also dump inference outputs on val and test splits.

python tools/launch.py <path-to-config> --checkpoint <path-to-checkpoint> --dump

Then you can pack the hl_{val,test}_submission.jsonl files and submit them to CodaLab.

๐Ÿ’ป Single Video Inference

Warning

This feature is only compatible with nncore==0.4.4.

Use the following command to perform moment retrieval using your own videos and queries.

# Make sure you are using the correct version
pip install nncore==0.4.4

python tools/inference.py <path-to-video> <query> [--config <path-to-config> --checkpoint <path-to-checkpoint>]

The checkpoint trained on QVHighlights using this config will be downloaded by default.

๐Ÿ“ฆ Model Zoo

We provide multiple pre-trained models and training logs here. All the models were trained on a single NVIDIA A100 80GB GPU and were evaluated using the default metrics of different datasets.

Dataset Config R1@0.3 R1@0.5 R1@0.7 MR mAP HD mAP Download
QVHighlights Default 78.71 67.74 51.87 47.86 39.45 model | log
Ego4D-NLQ Default 7.18 4.54 2.25 โ€” โ€” model | log
Charades-STA Default 70.91 60.48 38.66 โ€” โ€” model | log
TACoS Default 50.96 40.69 25.69 โ€” โ€” model | log
YouTube
Highlights
Dog โ€” โ€” โ€” โ€” 74.26 model | log
Gymnastics โ€” โ€” โ€” โ€” 72.07 model | log
Parkour โ€” โ€” โ€” โ€” 81.02 model | log
Skating โ€” โ€” โ€” โ€” 76.26 model | log
Skiing โ€” โ€” โ€” โ€” 74.36 model | log
Surfing โ€” โ€” โ€” โ€” 82.76 model | log
TVSum BK โ€” โ€” โ€” โ€” 91.23 model | log
BT โ€” โ€” โ€” โ€” 92.35 model | log
DS โ€” โ€” โ€” โ€” 80.88 model | log
FM โ€” โ€” โ€” โ€” 75.61 model | log
GA โ€” โ€” โ€” โ€” 89.51 model | log
MS โ€” โ€” โ€” โ€” 85.01 model | log
PK โ€” โ€” โ€” โ€” 82.82 model | log
PR โ€” โ€” โ€” โ€” 90.39 model | log
VT โ€” โ€” โ€” โ€” 89.81 model | log
VU โ€” โ€” โ€” โ€” 85.90 model | log

๐Ÿ“– Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2024tuning,
  title={$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding},
  author={Liu, Ye and He, Jixuan and Li, Wanhua and Kim, Junsik and Wei, Donglai and Pfister, Hanspeter and Chen, Chang Wen},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2024}
}