In this work, I built a model which can detect and then label the objects in videos and images. The resnet-101 model is pre-trained on the COCO dataset. The technique of masking is used along with object labelling using Mask-RCNN, instance segmentation, and Computer Vision. Samples folder contains the video file on which detection and segmentation is done.
Download the pretrained model weights from here https://github.com/matterport/Mask_RCNN/releases