Streaming Video Model
Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong Chen, Noel Codella, Zheng-Jun Zha
CVPR 2023
Streaming video model is a general video model, which is applicable to general video understanding tasks. Traditionally, video understanding tasks have been modeled by two separate architectures, specially tailored for two distinct tasks. Streaming video model is the first deep learning architecture that unifies video understanding tasks. We build an instance of streaming video model, namely the streaming video Transformer (S-ViT).S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve the frame-based video tasks. Then the frame features are input into a task-related temporal decoder to obtain spatiotemporal features for sequence-based tasks. The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition task and the competitive advantage over conventional architecture in the frame-based MOT task.
Clone the repo and install requirements:
conda create -n svm python=3.7 -y
conda activate svm
conda install pytorch==1.12.0 torchvision==0.13.0 cudatoolkit=11.3 -c pytorch
pip install git+https://github.com/JonathonLuiten/TrackEval.git
pip install mmcv-full==1.7.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12/index.html
pip install mmdet==2.26.0
pip install -r requirements/build.txt
pip install --user -v -e .
pip install einops
pip install future tensorboard
pip install -U fvcore
pip install click imageio[ffmpeg] path
Download MOT17, crowdhuman, and MOTSynth datasets and put them under the data directory. The data directory is structured as follows:
data
|-- crowdhuman
│ ├── annotation_train.odgt
│ ├── annotation_val.odgt
│ ├── train
│ │ ├── Images
│ │ ├── CrowdHuman_train01.zip
│ │ ├── CrowdHuman_train02.zip
│ │ ├── CrowdHuman_train03.zip
│ ├── val
│ │ ├── Images
│ │ ├── CrowdHuman_val.zip
|-- MOT17
│ ├── train
│ ├── test
|-- MOTSynth
| ├── videos
│ ├── annotations
Then, we need to convert the all dataset to COCO format. We provide scripts to do this:
# crowdhuman
python ./tools/convert_datasets/crowdhuman2coco.py -i ./data/crowdhuman -o ./data/crowdhuman/annotations
# MOT17
python ./tools/convert_datasets/mot2coco.py -i ./data/MOT17/ -o ./data/MOT17/annotations --split-train --convert-det
# MOTSynth
python ./tools/convert_datasets/extract_motsynth.py --input_dir_path ./data/MOTSynth/video --out_dir_path ./data/MOTSynth/train/
python ./tools/convert_datasets/motsynth2coco.py --anns ./data/MOTSynth/annotations --out ./data/MOTSynth/all_cocoformat.json
The processed dataset will be structured as follows:
data
|-- crowdhuman
│ ├── annotation_train.odgt
│ ├── annotation_val.odgt
│ ├── train
│ │ ├── Images
│ │ ├── CrowdHuman_train01.zip
│ │ ├── CrowdHuman_train02.zip
│ │ ├── CrowdHuman_train03.zip
│ ├── val
│ │ ├── Images
│ │ ├── CrowdHuman_val.zip
| ├── annotations
│ │ ├── crowdhuman_train.json
│ │ ├── crowdhuman_val.json
|-- MOT17
│ ├── train
│ │ ├── MOT17-02-DPM
│ │ ├── ...
│ ├── test
│ ├── annotations
│ │ ├── half-train_cocoformat.json
│ │ ├── ...
|-- MOTSynth
| ├── videos
│ ├── annotations
│ ├── train
│ │ ├── 000
│ │ │ ├── img1
│ │ │ │ ├── 000001.jpg
│ │ │ │ ├── ...
│ │ ├── ...
│ ├── all_cocoformat.json
We use CLIP pretrained ViT models. You can download them from here and put them under the pretrain
directory.
Training on single node
bash ./tools/dist_train.sh configs/mot/svm/svm_base.py 8 --cfg-options \
model.detector.backbone.pretrain=./pretrain/ViT-B-16.pt
Evaluation on MOT17 half validation set
bash ./tools/dist_test.sh configs/mot/svm/svm_test.py 8 \
--eval bbox track --checkpoint svm_motsync_ch_mot17half.pth
Method | Dataset | Train Data | MOTA | HOTA | IDF1 | URL |
---|---|---|---|---|---|---|
SVM | MOT17 | MOT17 half-train + crowdhuman + MOTSynth | 79.7 | 68.1 | 80.9 | model |
If you find this work useful in your research, please consider citing:
@InProceedings{Zhao_2023_CVPR,
author = {Zhao, Yucheng and Luo, Chong and Tang, Chuanxin and Chen, Dongdong and Codella, Noel and Zha, Zheng-Jun},
title = {Streaming Video Model},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {14602-14612}
}
Our code are built on top of MMTracking and CLIP. Many thanks for their wonderful works.