TASED-Net

TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection (ICCV 2019)

Overview

TASED-Net is a novel fully-convolutional network architecture for video saliency detection. The main idea is simple but effective: spatially decoding 3D video features while jointly aggregating all the temporal information. TASED-Net significantly outperforms previous state-of-the-art approaches on all three major large-scale datasets of video saliency detection: DHF1K, Hollywood2, and UCFSports. We observe that our model is especially better at attending to salient moving objects.

TASED-Net is currently leading the leaderboard of DHF1K online benchmark.

Model	Year	NSS↑	CC ↑	SIM↑	AUC-J↑	s-AUC↑
TASED-Net (updated)	2019	2.797	0.489	0.393	0.897	0.712
TASED-Net (reported)	2019	2.667	0.470	0.361	0.895	0.712
SalEMA	2019	2.574	0.449	0.466	0.890	0.667
STRA-Net	2019	2.558	0.458	0.355	0.895	0.663
ACLNet	2018	2.354	0.434	0.315	0.890	0.601
SalGAN	2017	2.043	0.370	0.262	0.866	0.709
SALICON	2015	1.901	0.327	0.232	0.857	0.590
GBVS	2007	1.474	0.283	0.186	0.828	0.554

Video Saliency Detection

Video saliency detection aims to model the gaze fixation patterns of humans when viewing a dynamic scene. Because the predicted saliency map can be used to prioritize the video information across space and time, this task has a number of applications such as video surveillance, video captioning, video compression, etc.

Examples

We compare our TASED-Net to ACLNet, which was the previously leading state-of-the-art method. As shown in the examples below, TASED-Net is better at attending to the salient information. We also would like to point out that TASED-Net has a much smaller network size (82 MB v.s. 252 MB).

Code Usage

First, clone this repository and download this weight file. Then, just run the code using

$ python run_example.py

This will generate frame-wise saliency maps. You can also specify the input and output directories as command-line arguments. For example,

$ python run_example.py ./example ./output

Notes

The released model is a modified version to increase the performance. The updated results are reported above.
We recommend using PNG image files as input (although examples of this repository are in JPEG format).
For the encoder of TASED-Net, we use the S3D network. We pretrained S3D on Kinetics-400 dataset using PyTorch and it achieves 72.08% top1 accuracy (top5: 90.35%) on the validation set of the dataset. We release our weight file for S3D together this project. If you find it useful, you might want to consider citing our work.
For training, we recommend using ViP, which is the video platform for general purposes in PyTorch. Otherwise, you can just use run_train.py. Before running the training code, make sure to download our weight file for S3D.

Citation

@inproceedings{min2019tased,
  title={TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection},
  author={Min, Kyle and Corso, Jason J},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  pages={2394--2403},
  year={2019}
}