SLTnet: Short-Term Anchor Linking and Long-Term Self-Guided Attention for Video Object Detection

Daniel Cores, Vícor M. Brea, Manuel Mucientes

Abstract

We present a new network architecture able to take advantage of spatio-temporal information available in videos to boost object detection precision. First, box features are associated and aggregated by linking proposals that come from the same \emph{anchor box} in the nearby frames. Then, we design a new attention module that aggregates short-term enhanced box features to exploit long-term spatio-temporal information. This module takes advantage of geometrical features in the long-term for the first time in the video object detection domain. Finally, a spatio-temporal double head is fed with both spatial information from the reference frame and the aggregated information that takes into account the short- and long-term temporal context. We have tested our proposal in five video object detection datasets with very different characteristics, in order to prove its robustness in a wide number of scenarios. Non-parametric statistical tests show that our approach outperforms the state-of-the-art.

This implementation is based on Detectron2.

ImageNet VID results

We provide the models and configuration files to reproduce the results obtained in the paper.

Method	Mode	mAP_@0.5	download
FPN-X101 baseline	Sequential	78.6	model \| config
SLTnet FPN-X101	Sequential	81.3	model \| config
SLTnet FPN-X101	Symmetric	81.9	model \| config

Setup

We provide a Docker image definition to run our algorithm. The image can be built as follows:

docker build -t detectron2-st:pytorch-cuda10.1-cudnn7 docker/detectron2_spatiotemporal

To train and test our network, ImageNet VID and ImageNet DET datasets are required. VID and DET annotations in a format compatible with our implementation can be downloaded from:

Training

To train the spatio-temporal network, we reuse the baseline weights keeping them frozen. Therefore, we first need to train our baseline running:

cd SLTnet
docker run --gpus all --rm -it -v $PWD:/workspace/detectron -v $datasets_dir:/datasets -v $models_dir:/models detectron2-st:pytorch-cuda10.1-cudnn7 python3 /workspace/detectron/tools/train_net.py --config-file $CONFIG_FILE OUTPUT_DIR /models/$DIRECTORY

Datasets definitions can be changed in spatiotemporal/data/dataset.py to set the correct image root directory and annotation paths. The final model can be found in OUTPUT_DIR/model_final.pth. However, this checkpoint also contains iteration number and other extra information apart from the weights. To initialize the spatio-temporal network we need to generate a new file that only contains the model (see facebookresearch/detectron2#429):

model = torch.load('OUTPUT_DIR/model_final.pth')
model['model']

Finally, the spatio-temporal network can be trained running:

docker run --gpus all --rm -it -v $PWD:/workspace/detectron -v $datasets_dir:/datasets -v $models_dir:/models detectron2-st:pytorch-cuda10.1-cudnn7 python3 /workspace/detectron/tools/train_net.py --config-file $CONFIG_FILE MODEL.WEIGHTS /models/$BASELINE_MODEL OUTPUT_DIR /models/$DIRECTORY SPATIOTEMPORAL.NUM_FRAMES 3 SPATIOTEMPORAL.FORWARD_AGGREGATION true

Testing

To evaluate the network in the test susbset, use:

docker run --gpus all --rm -it -v $PWD:/workspace/detectron -v $datasets_dir:/datasets -v $models_dir:/models detectron2-st:pytorch-cuda10.1-cudnn7 python3 /workspace/detectron/tools/train_net.py --eval-only --config-file $CONFIG_FILE MODEL.WEIGHTS /models/$WEIGHTS_DIRECTORY/model_final.pth OUTPUT_DIR /models/$DIRECTORY

Our implementation reports the COCO style AP. To calculate the AP with the ImageNet oficial Development kit, the output results can be converted running (inside a Docker container):

python3 tools/convert_output_to_vid.py

Use Custom Datasets

A new dataset can be registered in spatiotemporal/data/dataset.py adding a new entry to splits following this format:

 _DATA_DIR = "/datasets" 

 ...

"vid_val": ( # dataset name
    "vid/ILSVRC/Data/VID",  # image root directory from _DATA_DIR
    "vid/annotations_pytorch/vid_val.json" # json annotations file from _DATA_DIR
)

ST-COCO Dataset Format

We use a modified version of the COCO format dataset to support video datasets called ST-COCO. The main differences are:

images: images are ordered by video and frame number in the annotation file.
- video: video to which the image belongs.
- frame_number: frame number in the video.
annotations
- id_track: 'trackid' field in the original ImageNet VID annotation files.

ST-COCO example:

{
    'info': {}
    'images': [
        {
            'file_name': 'val/ILSVRC2015_val_00051001/000000.JPEG',
            'frame_number': 0,
            'height': 720,
            'id': 0,
            'video': 'val/ILSVRC2015_val_00051001',
            'width': 1280
        },
        ...
    ],
    
    'annotations':[
        {
            'area': 410130,
            'bbox': [0, 85, 651, 630],
            'category_id': 8,
            'id': 0,
            'id_track': '0',
            'ignore': 0,
            'image_id': 0,
            'iscrowd': 0,
            'occluded': '0'
        },
        ...
    ],

    'categories': [
        {'id': 0, 'name': 'airplane', 'supercategory': 'airplane'},
        ...
    ]
}

Citing SLTnet

@article{CORES2021104179,
    title = {Short-term anchor linking and long-term self-guided attention for video object detection},
    journal = {Image and Vision Computing},
    pages = {104179},
    year = {2021},
    issn = {0262-8856},
    doi = {https://doi.org/10.1016/j.imavis.2021.104179},
    author = {Daniel Cores and Víctor M. Brea and Manuel Mucientes}
}

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Swall0w/SLTnet