Daniel Cores, Vícor M. Brea, Manuel Mucientes
We present a new network architecture able to take advantage of spatio-temporal information available in videos to boost object detection precision. First, box features are associated and aggregated by linking proposals that come from the same \emph{anchor box} in the nearby frames. Then, we design a new attention module that aggregates short-term enhanced box features to exploit long-term spatio-temporal information. This module takes advantage of geometrical features in the long-term for the first time in the video object detection domain. Finally, a spatio-temporal double head is fed with both spatial information from the reference frame and the aggregated information that takes into account the short- and long-term temporal context. We have tested our proposal in five video object detection datasets with very different characteristics, in order to prove its robustness in a wide number of scenarios. Non-parametric statistical tests show that our approach outperforms the state-of-the-art.
This implementation is based on Detectron2.
We provide the models and configuration files to reproduce the results obtained in the paper.
Method | Mode | mAP@0.5 | download |
---|---|---|---|
FPN-X101 baseline | Sequential | 78.6 | model | config |
SLTnet FPN-X101 | Sequential | 81.3 | model | config |
SLTnet FPN-X101 | Symmetric | 81.9 | model | config |
We provide a Docker image definition to run our algorithm. The image can be built as follows:
docker build -t detectron2-st:pytorch-cuda10.1-cudnn7 docker/detectron2_spatiotemporal
To train and test our network, ImageNet VID and ImageNet DET datasets are required. VID and DET annotations in a format compatible with our implementation can be downloaded from:
- vid_val
- det_train_subsampled
- det_train_subsampled (images converted into short static videos)
- vid_train_split0
- vid_train_split1
- vid_train_split2
- vid_train_split3
To train the spatio-temporal network, we reuse the baseline weights keeping them frozen. Therefore, we first need to train our baseline running:
cd SLTnet
docker run --gpus all --rm -it -v $PWD:/workspace/detectron -v $datasets_dir:/datasets -v $models_dir:/models detectron2-st:pytorch-cuda10.1-cudnn7 python3 /workspace/detectron/tools/train_net.py --config-file $CONFIG_FILE OUTPUT_DIR /models/$DIRECTORY
Datasets definitions can be changed in spatiotemporal/data/dataset.py to set the correct image root directory and annotation paths. The final model can be found in OUTPUT_DIR/model_final.pth. However, this checkpoint also contains iteration number and other extra information apart from the weights. To initialize the spatio-temporal network we need to generate a new file that only contains the model (see facebookresearch/detectron2#429):
model = torch.load('OUTPUT_DIR/model_final.pth')
model['model']
Finally, the spatio-temporal network can be trained running:
docker run --gpus all --rm -it -v $PWD:/workspace/detectron -v $datasets_dir:/datasets -v $models_dir:/models detectron2-st:pytorch-cuda10.1-cudnn7 python3 /workspace/detectron/tools/train_net.py --config-file $CONFIG_FILE MODEL.WEIGHTS /models/$BASELINE_MODEL OUTPUT_DIR /models/$DIRECTORY SPATIOTEMPORAL.NUM_FRAMES 3 SPATIOTEMPORAL.FORWARD_AGGREGATION true
To evaluate the network in the test susbset, use:
docker run --gpus all --rm -it -v $PWD:/workspace/detectron -v $datasets_dir:/datasets -v $models_dir:/models detectron2-st:pytorch-cuda10.1-cudnn7 python3 /workspace/detectron/tools/train_net.py --eval-only --config-file $CONFIG_FILE MODEL.WEIGHTS /models/$WEIGHTS_DIRECTORY/model_final.pth OUTPUT_DIR /models/$DIRECTORY
Our implementation reports the COCO style AP. To calculate the AP with the ImageNet oficial Development kit, the output results can be converted running (inside a Docker container):
python3 tools/convert_output_to_vid.py
A new dataset can be registered in spatiotemporal/data/dataset.py adding a new entry to splits following this format:
_DATA_DIR = "/datasets"
...
"vid_val": ( # dataset name
"vid/ILSVRC/Data/VID", # image root directory from _DATA_DIR
"vid/annotations_pytorch/vid_val.json" # json annotations file from _DATA_DIR
)
We use a modified version of the COCO format dataset to support video datasets called ST-COCO. The main differences are:
- images: images are ordered by video and frame number in the annotation file.
- video: video to which the image belongs.
- frame_number: frame number in the video.
- annotations
- id_track: 'trackid' field in the original ImageNet VID annotation files.
ST-COCO example:
{
'info': {}
'images': [
{
'file_name': 'val/ILSVRC2015_val_00051001/000000.JPEG',
'frame_number': 0,
'height': 720,
'id': 0,
'video': 'val/ILSVRC2015_val_00051001',
'width': 1280
},
...
],
'annotations':[
{
'area': 410130,
'bbox': [0, 85, 651, 630],
'category_id': 8,
'id': 0,
'id_track': '0',
'ignore': 0,
'image_id': 0,
'iscrowd': 0,
'occluded': '0'
},
...
],
'categories': [
{'id': 0, 'name': 'airplane', 'supercategory': 'airplane'},
...
]
}
@article{CORES2021104179,
title = {Short-term anchor linking and long-term self-guided attention for video object detection},
journal = {Image and Vision Computing},
pages = {104179},
year = {2021},
issn = {0262-8856},
doi = {https://doi.org/10.1016/j.imavis.2021.104179},
author = {Daniel Cores and Víctor M. Brea and Manuel Mucientes}
}
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.