TimeChat: A Python repository from tongda

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou

News

[24.01.09] Release TimeChat-7b 🤗 checkpoint and local demo.
[23.12.27] 🤗 Release the instruction-tuning dataset of TimeIT.
[23.12.06] Release the initial version of TimeChat.

Introduction

TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions:
- (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame
- (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations.
We also construct an instruction-tuning dataset named TimeIT, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance.

Example Outputs

An illustration of temporal localization capability of TimeChat

Examples for dense video captioning (left), temporal video grounding (middle), and video highlight detection (right)

Fine-tuned Checkpoints

The following checkpoints store learnable parameters (positional embedding layers, Time-aware Frame Encoder, Sliding Video Q-Former, linear projection layers, and lora) only.

Checkpoint	LLM backbone	Link	Note
TimeChat-2-7B-Finetuned	LLaMA-2 7B	link	Fine-tuned on the instruction-tuning data from TimeIT-104K (asr version) and Valley-73K (previous version of current Valley-65K)

Usage

Enviroment Preparation

First, install ffmpeg.

apt update
apt install ffmpeg

Then, create a conda environment:

conda env create -f environment.yml
conda activate timechat
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

Prerequisites

Before fine-tuning your own model (or reproduce our TimeChat model), make sure you have obtained the following checkpoints:

Pre-trained Image Encoder (EVA ViT-g)

wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth

Pre-trained Image Q-Former (InstructBLIP Q-Former)

wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth

Pre-trained Language Decoder (LLaMA-2-7B) and Video Encoder (Video Q-Former of Video-LLaMA)

Use git-lfs to download weights of Video-LLaMA (7B):

git lfs install
git clone https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned

Instruct-tuned TimeChat-7B

git lfs install
git clone https://huggingface.co/ShuhuaiRen/TimeChat-7b

The file structure looks like:

ckpt/
|–– Video-LLaMA-2-7B-Finetuned/
    |-- llama-2-7b-chat-hf/
    |-- VL_LLaMA_2_7B_Finetuned.pth
|–– instruct-blip/
    |-- instruct_blip_vicuna7b_trimmed.pth
|–– eva-vit-g/
    |-- eva_vit_g.pth
|-- timechat/
    |-- timechat_7b.pth

How to Run Demo Locally

Please refer to our Jupyter Demo here.

Instruction-Tuning

Data

For now, the fine-tuning dataset consists of:

104K time-sensitive instructions from TimeIT [link]
- see DATA.md
73K (now 65K) video-based instructions from Valley [link]

Script

Tuning

Config the checkpoint and dataset paths in stage2_finetune_time104k_valley72k.yaml.

conda activate timechat
torchrun --nproc_per_node=8 train.py --cfg-path  train_configs/stage2_finetune_time104k_valley72k.yaml

Evaluation

Config the checkpoint and dataset paths in timechat.yaml.

Config the downstream task in eval.sh.

bash eval.sh

Recommended GPUs

Instruction-tuning: 8xV100 (32G)
Inference: 1xA100 (40G/80G) or 1xA6000

Acknowledgement

We are grateful for the following awesome projects our TimeChat arising from:

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
EVA-CLIP: Improved Training Techniques for CLIP at Scale
LLaMA: Open and Efficient Foundation Language Models
VideoChat: Chat-Centric Video Understanding
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Term of Use

Our TimeChat is just a research preview intended for non-commercial use only. You must NOT use our TimeChat for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines.

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@article{Ren2023TimeChat,
  title={TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding},
  author={Shuhuai Ren and Linli Yao and Shicheng Li and Xu Sun and Lu Hou},
  journal={ArXiv},
  year={2023},
  volume={abs/2312.02051},
}

tongda/TimeChat