
Implementation of:
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri Learning Spatiotemporal Features With 3D Convolutional Networks. Proceedings of the IEEE International Conference on Computer Vision. 2015. full article

Updated to keras 2.2.4


You can download weights of the model trained on the original dataset (Sport_1M) here:

To run the c3d/example.py you should place them in a models directory or change the path inside the example.py accordingly.

The weights were converted from the Caffe format with the code and instructions in this project:

How to run

  1. Build a docker image:
    Run make build from the top directory.

  2. Check if everything works:
    Run make example from the top directory.

You should see this message if everything is ok: alt success message

Model's architecture

Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 16, 112, 112, 3)   0         
conv1 (Conv3D)               (None, 16, 112, 112, 64)  5248      
pool1 (MaxPooling3D)         (None, 16, 56, 56, 64)    0         
conv2 (Conv3D)               (None, 16, 56, 56, 128)   221312    
pool2 (MaxPooling3D)         (None, 8, 28, 28, 128)    0         
conv3a (Conv3D)              (None, 8, 28, 28, 256)    884992    
conv3b (Conv3D)              (None, 8, 28, 28, 256)    1769728   
pool3 (MaxPooling3D)         (None, 4, 14, 14, 256)    0         
conv4a (Conv3D)              (None, 4, 14, 14, 512)    3539456   
conv4b (Conv3D)              (None, 4, 14, 14, 512)    7078400   
pool4 (MaxPooling3D)         (None, 2, 7, 7, 512)      0         
conv5a (Conv3D)              (None, 2, 7, 7, 512)      7078400   
conv5b (Conv3D)              (None, 2, 7, 7, 512)      7078400   
zeropad5 (ZeroPadding3D)     (None, 2, 8, 8, 512)      0         
pool5 (MaxPooling3D)         (None, 1, 4, 4, 512)      0         
flatten_2 (Flatten)          (None, 8192)              0         
fc6 (Dense)                  (None, 4096)              33558528  
dropout_1 (Dropout)          (None, 4096)              0         
fc7 (Dense)                  (None, 4096)              16781312  
dropout_2 (Dropout)          (None, 4096)              0         
fc8 (Dense)                  (None, 487)               1995239   
Total params: 79,991,015
Trainable params: 79,991,015
Non-trainable params: 0

Features extraction

To extract features from a video you should:

  1. Divide a video into 16 frame chunks with 8 frame overlaps (dim=(16, 112, 112, 3)).
  2. Use output from the first fully connected layer fc6 as a features from the single chunk. (implemented in sport1m_model.create_features_exctractor)
  3. Average features extracted from all vectors to form a single vector (dim=4096).
  4. L2-normalize the vector.

Note: Authors used 2 seconds clips during training.