This repo implements the network structure of P3D[1] with PyTorch, pre-trained model weights are converted from caffemodel, which is supported from the author's repo
- pytorch
- numpy
In the author's official repo, only P3D-199 is released. Besides this deepest P3D-199, I also implement P3D-63 and P3D-131, which are respectively modified from ResNet50-3D and ResNet101-3D, the two nets may bring more convenience to users who have only memory-limited GPUs.
(Pretrained weights of P3D63 and P3D131 are not yet supported)
(tips: I feel sorry to canceal the download urls of pretrained weights because of some private reasons. For more information you could send emails to me.) (New tips: Model weights now are available.)
1, P3D-199 trained on Kinetics dataset:
2, P3D-199 trianed on Kinetics Optical Flow (TVL1):
3, P3D-199 trained on Kinetics600, RGB, 224&299:
BaiduYun url Google Drive (Change the value of GAP kernel from 5 to 7 if 224, to 9 if 299)
from __future__ import print_function
from p3d_model import *
import torch
model = P3D199(pretrained=True,num_classes=400)
model = model.cuda()
data=torch.autograd.Variable(torch.rand(10,3,16,160,160)).cuda() # if modality=='Flow', please change the 2nd dimension 3==>2
out=model(data)
print(out.size(),out)
-
ST-Structures:
All P3D models in this repo support various forms of ST-Structures like ('A','B','C') ,('A','B') and ('A'), code is as follows.
model = P3D63(ST_struc=('A','B')) model = P3D131(ST_struc=('C'))
-
Flow and RGB models:
Set parameter modality='RGB' as 'RGB' model, 'Flow' as flow model. Flow model i trained on TVL1 optical flow images.
model= P3D199(pretrained=True,modality='Flow')
-
Finetune the model
when finetuning the models on your custom dataset, use get_optim_policies() to set different learning speed for different layers. e.g. When dataset is small, Only need to train several deepest layers, set slow_rate=0.8 in code, and change the following lr_mult,decay_mult.
please cite this repo if you take use of it.
Some of them have outperforms state of the arts.
- Action recognition(mean accuracy on UCF101):
modality/model | RGB | Flow | Fusion |
---|---|---|---|
P3D199 (Sports-1M) | 88.5% | - | - |
P3D199 (Kinetics) | 91.2% | 92.4% | 98.3% |
- Action localization(mAP on Thumos14):
Step | perframe | localization |
---|---|---|
P3D199(Sports-1M | 0.451 | 0.25 |
P3D199(Kinetics) | 0.569(fused) | 0.307 |
Reference:
[1]Learning Spatio-Temporal Representation with Pseudo-3D Residual,ICCV2017