Machine Learning Papers

Introduction

Microsoft COCO: Common Objects in Context: New image recognition, segmentation and capturing dataset. Link to COCO Dataset
- Image annonation with Amazon's Mechanical Turk
- Bounding Box Detection with DPMv5-P/DPMv5-C
YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video: New large-scale data set of video URLs with densely-sampled object bounding box annotations. (Approximately 380,000 video segments about 19s long)
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions: New video dataset. Every person is localized using a bounding box and the attached labels correspond to actions being performed by the person. There is one action corresponding to the pose of the person (whether he or she is standing, sitting, walking, swimming etc.) and there may be additional actions corresponding to interactions with objects or human-human interactions. The main differences with existing video datasets are:
- the definition of atomic visual actions, which avoids collecting data for each and every complex action
- precise spatio-temporal annotations with possibly multiple annotations for each human
- the use of diverse, realistic video material (movies)

Google research blog: Supercharge your Computer Vision models with the TensorFlow Object Detection API: Google won COCO Detection challenge with they in-house object detection. This system is available via TensorFlow Object Detection API
SSD: Single Shot MultiBox Detector: Object detection with a single deep neural network Github Code
Speed/accuracy trade-offs for modern convolutional object detectors: Guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. They compare SSD, Faster R-CNN and R-FCN meta-achitecture with some architecural configuation like feature extractor, matching and Box encoding.
List of feature extractor
- Inception Resnet V2: Deep convolutional networks. Blog article
- Inception architecture v3: Deep convolutional networks. TensorFlow Github code
- VGG-16
- MobileNet
- Inception V2
- ResNet-101
Spatially Adaptive Computation Time for Residual Networks: This paper proposes a deep learning architecture based on Residual Network that dynamically adjusts the number of executed layers for the regions of the image.
Tensorflow Object Detection API example

FaceNet: A Unified Embedding for Face Recognition and Clustering: Directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. The benefit of this approach is much greater representational efficiency: they achieve state-of-the-art face recognition performance using only 128-bytes per face.