Scripts for video recognition
Test acc = video classification accuracy on the UCF-101 [1] test set (split 1), %
Model, fine tuning layers | Test acc | Parameters |
---|---|---|
VGG-16, last layer | 73.6 | default |
VGG-16, fc layers | 76.5 | default |
VGG-16, fc layers | 77.0 | dropout=0.8 |
VGG-16, fc layers | 77.3 | dropout=0.8, weight decay=5e-4 |
VGG-16, all layers | 75.3 | batch size=32, max_iter=40k, step_size=20k |
ResNet-50, last layer | 76.5 | weight decay=5e-4 |
ResNet-50, all layers | 79.5 | batch size=32, max_iter=step_size=20k, weight decay=5e-4 |
ResNet-50, last layer | 78.5 | pool5 layer modified, weight decay=5e-4 |
ResNet-50, last layer | 79.4 | pool5 layer modified, weight decay=5e-4, dropout=0.5 |
ResNet-50, all layers | 80.4 (81.4)* | batch size=32, max_iter=step_size=20k, pool5 layer modified, weight decay=5e-4, dropout=0.5 |
*all video frames used for prediction (be default only 25 frames with stride 3 are used for prediction) |
Model, fine tuning layers | Test acc | Parameters |
---|---|---|
ResNet-50, LSTM+last layer (code) | 71.7 (73.1) | 512 units in LSTM, 25 frames for training, 25 (75) frames for prediction |
ResNet-50, bnorm+LSTM+last layer (code) | 74.0 | 512 units in LSTM, 25 frames for training, 75 frames for prediction |
Model, fine tuning layers | Test acc* |
---|---|
CNN-M-2048, last layer | 72.7 [2] |
Improved DT+FV | 85.9 [3] |
State of the art | 95.6 [4] |
*accuracy averaged over 3 splits in some works |
[1] UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild, 2012
[2] Two-Stream Convolutional Networks for Action Recognition in Videos, 2014
[3] Action recognition with improved trajectories, 2013
[4] Deep Temporal Linear Encoding Networks, 2016