
[ECCV2022] PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection

Primary LanguagePythonMIT LicenseMIT


This repo is an official implementation of PTSEFormer, which is accepted by ECCV 2022.

Citing PTSEFormer

Please consider citing our paper if you find it useful:

    Author = {Han, Wang and Jun, Tang and Xiaodong, Liu and Shanyan, Guan and Rong, Xie and Li, Song},
    Title = {PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection},
    Conference = {ECCV},
    Year = {2022}



  • Linux, CUDA>=9.2, GCC>=5.4

  • Python>=3.7

    conda create -n PTSEFormer python=3.7 pip

    Then, activate the environment:

    conda activate PTSEFormer
  • PyTorch>=1.5.1, torchvision>=0.6.1

    conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch
  • Other requirements

    pip install -r requirements.txt

Data preparation

Please download the ILSVRC2015 DET and ILSVRC2015 VID dataset from here and organize them as following.



The inference command line for testing on the validation dataset:

python -m torch.distributed.launch --nproc_per_node=8 tools/test.py --config-file experiments/PTSEFormer_r101_8gpus.yaml

Pretrained model can be found here.


The training command line for training on a combined dataset of VID and DET:

python -m torch.distributed.launch --nproc_per_node=8 tools/train.py --config-file experiments/PTSEFormer_r101_8gpus.yaml