Implementation of brand new video augmentation strategy for video action recognition with 3D CNN.
- reference
- videomix paper: https://arxiv.org/abs/2012.03457
- pytorch-i3d(basic video transformation code): https://github.com/piergiaj/pytorch-i3d
Tubemix is like a 'video-in-video' augmentation, and Stackmix is 'video-to-video'. Use videomix.py
for implementing Tubemix and Stackmix.
rgb "TableTennis" | rgb "Archery" |
---|---|
rgb stackmix | rgb tubemix |
---|---|
opt flow-u "TableTennis" | opt flow-u "Archery" |
---|---|
opt flow-u stackmix | opt flow-u tubemix |
---|---|
Training on I3D with Stackmix and Tubemix augmentation.
* RCRF represents applying random crop and random flipping Probability of implementing Stackmix or Tubemix is fixed to p=0.5.
I've also explored how beta distribution effect on training accuracy. Figure below shows the PDF of &\lambda$~beta(α,α).
- UCF-101
Augmentation method | hyper-parameter | Spatial Stream | Temporal Stream |
---|---|---|---|
Baseline(RCRF) | - | 93.23% | 91.36% |
Tubemix | α=8 | 93.97% | 92.23% |
Tubemix | α=0.4 | 93.74% | 93.02% |
Tubemix | α=2 | 93.52% | 92.55% |
Stackmix | α=8 | 94.23% | 92.68% |
Stackmix | α=0.4 | 94.29% | 93.05% |
Stackmix | α=2 | 94.00% | 92.84% |
Stackmix | α=1 | 93.97% | 93.34% |
- HMDB-51
Augmentation method | hyper-parameter | Spatial Stream | Temporal Stream |
---|---|---|---|
Baseline(RCRF) | - | 74.97% | 75.62% |
Tubemix | α=8 | 74.05% | 76.80% |
Tubemix | α=0.4 | 74.31% | 77.12% |
Tubemix | α=2 | 73.79% | 76.99% |
Tubemix | α=1 | 74.71% | 77.06% |
Stackmix | α=8 | 73.59% | 76.80% |
Stackmix | α=0.4 | 74.05% | 77.25% |
Stackmix | α=2 | 74.58% | 76.21% |
Stackmix | α=1 | 73.99% | 75.88% |
Apply on UCF-101, Stackmix and Tubemix could derive performance improvement(+1~2%) with UCF-101 datasets in both streams. However, when applied to HMDB-51, the performance of temporal stream was improved, but the performance of spatial stream was rather reduced.
- Speed jittering augmentation is work in progress.