Speed prediction in video input

Uses a 3D convolution network over 20 timesteps explained in "Learning Spatiotemporal Features with 3D Convolutional Networks". Initially, predictions are inaccurate, but becomes better after 9 minutes. Better hyperparameter tuning or network choice is needed.