SpatioTemporal Annotations of 24 for Classes of UCF101

Introduction

Bounding box annotations of humans for 24 action classes of UCF101 dataset is available at the download page of THUMOS dataset in xml format. Parsing these annotation is not as easy, which resulted in different result of everyone working on spatio-temporal localisation of action on these 24 classes.

We gather three parsed version of above available annotations.

  1. Saha et. al. [1] provide parsed annotations in .mat format and are available here. These annotation were used by [1,4,5] in their works. We keep it in this repository under filename annotV5.mat
  2. We asked Weinzaepfel et. al.[2] for annotations, which were used in [2] and initial version of [4]. Current version of [4] uses annotations provided by [1]. We keep these annotation under filename annot_full_phillipe.mat
  3. Gemert et. al. [3] provided theirs version of parsed annotations apt
  4. . It is kept under filename annot_apt.mat

There are various problem and advantages in these parsed version when compared to each other. For instance, Saha's version has most number of action instances but temporal labelling and consistency between different action instances is problem. Gemert's version is very much similar to Saha's one. Weinzaepfel's version doesn't pick up all the action instances, some videos doesn't even have one action instance, but bounding-boxes accuracy temporal labelling accuracy are slightly better than other version. Parsing original was going to lead to similar problems.

Corrections

So, I went through the pain to look through the annotations of each video. I found that around 600 videos annotations of Saha's had some problems and 300 had decent annotations for those videos in either in Weinzaepfel's version or Gemert's version. Finally, rest required more intelligent combination of three versions. At the the end, we ended up with 3194 videos with good annotations in filename finalAnnots.mat. We had to remove 9 videos for which we couldn't manage re-annotation, 4 were test videos and 6 train videos. There may be 5-10 videos which still might have some error in annotations.

After feedback from Philippe Weinzaepfel, I found there were about 30 tubes with wrong bounding boxes or duration of tubes was different than number of boxes. So, I looked at those tubes visually and corrected the boxes manually. At present there are 3194 videos with correct annotations. All the bounding boxes are within image boundaries and temporal durations are within the temporal bounds of videos.

Download

This repository has both new and old annotations in root directory itself. Final corrected annotation file is named as finalAnnots.mat. Python Version of the same is now avaiable pyannot.pkl

ALSO

You can also donwload the results from

Results (100MB)

Performance Numbers

We have evaluated the approach of [1] and [5] and report the performance of their approaches on older annotations from [1] and new corrected annotations annotations. These results are produced on 911 test videos of split 1.

Below is the table using older annotations from [1]

IoU Threshold = 0.20 0.50 0.75 0.5:0.95
Peng et al [4] RGB+FLOW 73.55 30.87 01.01 07.11
Saha et al [1] RGB+FLOW 67.89 36.87 06.78 14.29
Singh et al [5] RGB+FastFLOW 69.12 41.16 10.31 17.19
Singh et al [5] RGB+FLOW 70.73 43.20 10.43 18.02

Below is the table using new corrected annotations.

IoU Threshold = 0.20 0.50 0.75 0.5:0.95
Peng et al [4] RGB+FLOW 73.67 32.07 00.85 07.26
Saha et al [1] RGB+FLOW 66.55 36.37 07.94 14.37
Singh et al [5] RGB+FastFLOW 70.20 43.00 14.10 19.20
Singh et al [5] RGB+FLOW 73.50 46.30 15.00 20.40

If you want to regenerate those number please go to google drive link. Then, download the results folder in to root directory of this repository from download Now, you can run compute_mAPs.m form inside the evaluation folder.

Conclusion

Difference from above number might seem small, in my view having better annotations is good for the community. Also, it serves as baseline for future works. So, results of future works are directly comparable to previous state-of-the-art methods [1,4,5]. We recommend using provided evaluation script to evaluate your method. We will try to keep updating this page with additional results from other methods.

If you want your results to be included on this page. Please send me final results in same format as of provided for [1] and [5]. It is same format as annotations format.

Citing

If you use above annotation in your work, please cite the origibal UCF101 dataset

  @article{soomro2012ucf101,
    title={UCF101: A dataset of 101 human actions classes from videos in the wild},
    author={Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
    journal={arXiv preprint arXiv:1212.0402},
    year={2012}
  }

Also Please consider citing work below. Annotion were corrected by Gurkirt Singh while while work on real-time action detection pipeline described in the following work.

  @inproceedings{singh2016online,
    title={Online Real time Multiple Spatiotemporal Action Localisation and Prediction},
    author={Singh, Gurkirt and Saha, Suman and Sapienza, Michael and Torr, Philip and Cuzzolin, Fabio},
    jbooktitle={ICCV},
    year={2017}
  }

References

  1. S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, Deep learning for detecting multiple space-time action tubes in videos. In British Machine Vision Conference, 2016
  2. P. Weinzaepfel, Z. Harchaoui, and C. Schmid, Learning to track for spatio-temporal action localization. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, June 2015
  3. J. C. van Gemert, M. Jain, E. Gati, and C. G. Snoek. Apt: Action localization proposals from dense trajectories. In BMVC, volume 2, page 4, 2015.
  4. X. Peng and C. Schmid. Multi-region two-stream R-CNN for action detection. In ECCV 2016 - European Conference on Computer Vision, Amsterdam, Netherlands, Oct. 2016.
  5. G. Singh, S Saha, M. Sapienza, P. H. S. Torr and F Cuzzolin. "Online Real time Multiple Spatiotemporal Action Localisation and Prediction." ICCV, 2017.