/computer_vision_literature_review

Jinwoo's literature review on computer vision and machine learning papers

Computer Vision Literature Review

Jinwoo's literature review on computer vision and machine learning papers

Contents

Action Recognition

Spatio-Temporal Action Detection

Temporal Action Detection

Spatio-Temporal ConvNets

Action Classification

Video Representation

  • A Closer Look at Spatiotemporal Convolutions for Action Recognition - D. Tran et al., CVPR2018. [code]   "2D+1D separate convolution is better than 3D convolution"

    • 3D ConvNet architecture search on action classification task
    • Baselines implemented using vanilla ResNet-like architecture (has a skip connection)
      • fR2D: 2D convolutions over frames independently
      • R2D: 2D convolutions over the entire clip. Reshape 4D input tensor x of shape LxHxWx3 to 3LxHxW
      • R3D: Use 3D convolutions
      • MCx: Use 3D convolutions in the first x layers, use 2D convolutions in the remaining layers
      • rMCx: Use 2D convolutions in the first x layers, use 3D convolutions in the remaining layers
      • R(2+1)D: Use 2D convolutions + 1D convolutions throughout the entire network. Note that R(2+1)D and R3D have roughly same number of parameters and same computation complexity
    • For all the baselines. they sample bunch of clips to do a video classification (avg pooling is conducted to aggreate clip-level predictions)
      • In contrast, I3D just use a single clip with L=64 frames by random sampling (for both training and testing)
    • Datasets used:
      • Training from scratch: Sports 1M, Kinetics
      • Transfer learning: UCF101, HMDB51
    • Observations
      • 2D + 1D convolution is better than 3D convolution, 2D convolution and 3D and 2D convolutions mixed
      • Mixed 3D and 2D models: MCx(3D conv early) is beter than rMCx(3D conv in deeper layers)
        • Motion pattern is important in earlier layers
        • This is an opposite observation from Xie et al.
            - Performance
      • For RGB only and flow only models, R(2+1)D is better than I3D
      • R(2+1)D two-stream model shows slightly worse performance than I3D two-stream model on Kinetics
        • Note that I3D is pretrained on ImageNet, while R(2+1)D is trained from scratch
      • Why R(2+1)D is better than R3D (single RGB/Flow models)?
        • Double number of non-linearity layers
        • Better for optimization (note R(2+1)D shows lower training error than R3D)
  • Rethinking Spatiotemporal Feature Learning For Video Understanding - S. Xie et al., arXiv2017.

    "Improving I3D, called S3D-G" In this paper, I3D, which inflates all the 2D filters of the InceptionNet to 3D, is enhanced. First, we replace 3D convolutions in a bottom layers to 2D and get higher accuracy and computation efficiency and more compact model. Second, we separate temporal convolution from spatial convolution in every 3D convolution layer. This also makes higher accuracy, more compact model, and faster speed. Finally, spatiotemporal gating is introduced to further boost the accuracy. We show their model performance on the large scale Kinetics dataset for an ablation study. Also we show the proposed model, S3D-G, is generalizable to other tasks such as action classification and detection.

    • Action classification performance: 96.8% on UCF-101, 75.9% on HMDB-51 (pretrained on Kinetics)
    • Action detection performance: 80.1% on UCF-101, 72.1% on JHMDB (pretrained on Kinetics)
    • Maybe most gains come from the Kinetics dataset pretraining.
  • ConvNet Architecture Search for Spatiotemporal Feature Learning - D. Tran et al., arXiv2017. Note: Aka Res3D. [code]: In the repository, C3D-v1.1 is the Res3D implementation.

    "3D version of ResNet" In this paper, a 3D version of Residual Network is introduced to better encode spatio-temporal information in a video by extensive experimental search. We fix the number of parameters to 33M and conduct extensive experiments to find an optimal architecture. The Res3D contains 1) skip connections, 2) using frame sampling rate of 2 or 4 (optimal on UCF-101), 3) spatial resolution 112x112, 4) layer depth 18. We also find that using 3D conv is better than using 2D conv or 2.5D conv (spatial and temporal conv separated). Shows higher accuracy than C3D on UCF101 and HMDB51. 85.8 vs. 82.3 and 54.9 vs. 51.6 respectively. 2 times faster speed and 2 times smaller model size.

  • Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks - Z. Qui et al., ICCV2017. [code]

  • Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset - J. Carreira et al., CVPR2017. Note: Aka I3D. [code]: training code is not provied [unofficial code]: training code is provided but not offical

  • Spatiotemporal Residual Networks for Video Action Recognition - C. Feichtenhofer et al., NIPS2016. [code]

  • Learning Spatiotemporal Features with 3D Convolutional Networks - D. Tran et al., ICCV2015. [the official Caffe code] [project web] Note: Aka C3D. [Python Wrapper] Note that the official caffe does not support python wrapper. [TensorFlow], [TensorFlow + Keras], [Another TensorFlow Implemetation], [Keras C3D Project web]: [Keras code], [Pretrained weights].

Miscellaneous

Action Recognition Datasets

Video Annotation

Object Recognition

Object Detection (Image)

Video Object Detection

  • [Detect to Track and Track to Detect] - C. Feichtenhofer et al., ICCV2017. [code], [project web]

    "Video Object Detection and Tracking using R-FCN"

    • On top of two frame-level ConvNets one is for frame t and the other is for frame t + $\tau$
    • Propose a multi-task objective consists of 1)classification loss, 2)bbox regression loss, 3)tracking loss
      • The tracking loss is smooth L1 loss between ground truth and a "tracking regression value" for frame t + $\tau$
      • Correlation feature map between the detection at frame t and search candidates at frame t + $\tau$ is computed
      • RoI Pooling operation is applied to the correlation feature map
    • Evaluation on ImageNET VID dataset
  • [Flow-Guided Feature Aggregation for Video Object Detection] - X. Zhu et al., ICCV2017. [code], aka FGFA

    "Using optical flow to guide the temporal feature aggregation for frame-level detection"

    • Temporally aggregating the frame-level features
      • Use FlowNet to estimate the motion between reference frame and nearby frames
      • Warp the nearby frames' feature map by a bilinear warping function to the reference frame
      • Temporally aggregate the feature map of the reference frame, and feature maps of the warped nearby frame
        • Use element-wise summation with adaptive weights for the aggregation
        • Adaptive weights are computed by cosine similarity measure between the reference frame feature and the nearyby frame feature
    • Apply temporal dropout during training
      • Dropping out the random nearby frames, e.g. Dropping out 3 frames when testing frame range is 5 and training frame range is 2
      • This means we could incorporate long term temporal context by using long testing frame range but at the same time, we could use only training frame range 2 to reduce the computation/memory requirements when training

Video Object Detection Datasets

Pose Estimation

Pose Estimation

  • Detect-and-Track: Efficient Pose Estimation in Videos - R. Girdhar et al., arXiv2017.

    "Pose tracking by 3D Mask R-CNN"

    • Two-stage approach: 1)dense prediction, 2)link (track) afterwards
    • Use 3D Mask R-CNN to detect body keypoints every frame
      • Convert the 2D convolutions of ResNet to 3D convolutions
      • First show that using the 2D Mask R-CNN achieves the state-of-the-art performance
      • Then show that using the proposed "inflated" 3D Mask R-CNN shows better performance than 2D counter part when using the same backbone architecture
      • Propose tube proposal network which regresses tube anchors
        • Tube anchors are nothing but spatial anchors duplicated in time
    • Use bipartite matching to link the predictions over time
    • Evaluate on PoseTrack dataset
  • OpenPose Library - Caffe based realtime pose estimation library from CMU.

  • Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields - Z. Cao et al., CVPR2017. [code] depends on the [caffe RT pose] - Earlier version of OpenPose from CMU

Licenses

License

CC0

To the extent possible under law, Jinwoo Choi has waived all copyright and related or neighboring rights to this work.

Contributing

Please read the contribution guidelines. Then please feel free to send me pull requests or email (jinchoi@vt.edu) to add links.