Video action recognition with two-stream CNN's using MS-ASL dataset.
The model consists of two CNNs, one using a single RGB frame and another one using stacked grayscale optical flow images generated from the video. These models are fused before the last fully-connected layer. (Early-fusion)
Detailed description of the dataset can be found in the official paper.
- Learning rate: 0.001
- Number of epochs: 32
- Batch size: 64
- Loss function: Cross entropy loss
- Optimizer: Adam
Two-stream CNN architecture is the one proposed in the paper by Karen Simonyan and Andrew Zisserman. Only difference is the usage of early fusion instead of late fusion.