Streaming Video Model

Streaming Video Model
Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong Chen, Noel Codella, Zheng-Jun Zha
CVPR 2023

[Paper] [arXiv]

Description

Streaming video model is a general video model, which is applicable to general video understanding tasks. Traditionally, video understanding tasks have been modeled by two separate architectures, specially tailored for two distinct tasks. Streaming video model is the first deep learning architecture that unifies video understanding tasks. We build an instance of streaming video model, namely the streaming video Transformer (S-ViT).S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve the frame-based video tasks. Then the frame features are input into a task-related temporal decoder to obtain spatiotemporal features for sequence-based tasks. The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition task and the competitive advantage over conventional architecture in the frame-based MOT task.

Usage

Installation

Clone the repo and install requirements:

conda create -n svm python=3.7 -y
conda activate svm
conda install pytorch==1.12.0 torchvision==0.13.0 cudatoolkit=11.3 -c pytorch
pip install git+https://github.com/JonathonLuiten/TrackEval.git
pip install mmcv-full==1.7.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12/index.html
pip install mmdet==2.26.0
pip install -r requirements/build.txt
pip install --user -v -e .
pip install einops
pip install future tensorboard
pip install -U fvcore
pip install click imageio[ffmpeg] path

Dataset preparation

Download MOT17, crowdhuman, and MOTSynth datasets and put them under the data directory. The data directory is structured as follows:

data
|-- crowdhuman
│   ├── annotation_train.odgt
│   ├── annotation_val.odgt
│   ├── train
│   │   ├── Images
│   │   ├── CrowdHuman_train01.zip
│   │   ├── CrowdHuman_train02.zip
│   │   ├── CrowdHuman_train03.zip
│   ├── val
│   │   ├── Images
│   │   ├── CrowdHuman_val.zip
|-- MOT17
│   ├── train
│   ├── test
|-- MOTSynth
|   ├── videos
│   ├── annotations

Then, we need to convert the all dataset to COCO format. We provide scripts to do this:

# crowdhuman
python ./tools/convert_datasets/crowdhuman2coco.py -i ./data/crowdhuman -o ./data/crowdhuman/annotations

# MOT17
python ./tools/convert_datasets/mot2coco.py -i ./data/MOT17/ -o ./data/MOT17/annotations --split-train --convert-det

# MOTSynth
python ./tools/convert_datasets/extract_motsynth.py --input_dir_path ./data/MOTSynth/video --out_dir_path ./data/MOTSynth/train/
python ./tools/convert_datasets/motsynth2coco.py --anns ./data/MOTSynth/annotations --out ./data/MOTSynth/all_cocoformat.json

The processed dataset will be structured as follows:

data
|-- crowdhuman
│   ├── annotation_train.odgt
│   ├── annotation_val.odgt
│   ├── train
│   │   ├── Images
│   │   ├── CrowdHuman_train01.zip
│   │   ├── CrowdHuman_train02.zip
│   │   ├── CrowdHuman_train03.zip
│   ├── val
│   │   ├── Images
│   │   ├── CrowdHuman_val.zip
|   ├── annotations
│   │   ├── crowdhuman_train.json
│   │   ├── crowdhuman_val.json
|-- MOT17
│   ├── train
│   │   ├── MOT17-02-DPM
│   │   ├── ...
│   ├── test
│   ├── annotations
│   │   ├── half-train_cocoformat.json
│   │   ├── ...
|-- MOTSynth
|   ├── videos
│   ├── annotations
│   ├── train
│   │   ├── 000
│   │   │   ├── img1
│   │   │   │   ├── 000001.jpg
│   │   │   │   ├── ...
│   │   ├── ...
│   ├── all_cocoformat.json

Pretrained models

We use CLIP pretrained ViT models. You can download them from here and put them under the pretrain directory.

Training and Evaluation

Training on single node

bash ./tools/dist_train.sh configs/mot/svm/svm_base.py 8 --cfg-options \
   model.detector.backbone.pretrain=./pretrain/ViT-B-16.pt

Evaluation on MOT17 half validation set

bash ./tools/dist_test.sh configs/mot/svm/svm_test.py 8 \
   --eval bbox track --checkpoint svm_motsync_ch_mot17half.pth

Main Results

MOT17

Method	Dataset	Train Data	MOTA	HOTA	IDF1	URL
SVM	MOT17	MOT17 half-train + crowdhuman + MOTSynth	79.7	68.1	80.9	model

Citation

If you find this work useful in your research, please consider citing:

@InProceedings{Zhao_2023_CVPR,
    author    = {Zhao, Yucheng and Luo, Chong and Tang, Chuanxin and Chen, Dongdong and Codella, Noel and Zha, Zheng-Jun},
    title     = {Streaming Video Model},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {14602-14612}
}

Acknowledgement

Our code are built on top of MMTracking and CLIP. Many thanks for their wonderful works.

yuzhms/Streaming-Video-Model