/mot-papers

A collection of Multiple Object Tracking (MOT) papers in recent years, with notes.

ALL MOT CORE PAPERS

CVPR 2017

CVPR 2016

CVPR 2015

CVPR 2014

CVPR 2013

ICCV 2015

ICCV 2013

Others 2017

Others 2016

Surveys

MOT method rankings

Ordered based on their overall performance ranking on MOT challenges.

MOT 2017

  1. (FWT, Arxiv 2017) Improvements to Frank-Wolfe optimization for multi-detector multi-object tracking

  2. (jCC, Arxiv 2016) A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects

  3. (MHT_DAM, ICCV 2015) Multiple Hypothesis Tracking Revisited

  4. (EDMT17, CVPRw 2017) Enhancing Detection Model for Multiple Hypothesis Tracking

  5. (IOU17, AVSS 2017) High-Speed Tracking-by-Detection Without Using Image Information

MOT 2016

  1. (LMP, CVPR 2017) Multiple People Tracking with Lifted Multicut and Person Re-identification

  2. (FWT, Arxiv 2017) Improvements to Frank-Wolfe optimization for multi-detector multi-object tracking

  3. (NLLMPa, CVPR 2017) Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications

  4. (AMIR, Arxiv 2017) Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies

  5. (MCjoint, CoRR 2016) A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects

  6. (NOMT, ICCV 2015) Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor

  7. (JMC, BMTT 2016) Multi-Person Tracking by Multicuts and Deep Matching

  8. (STAM16, Arxiv 2017) Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism

  9. (MHT_DAM, ICCV 2015) Multiple Hypothesis Tracking Revisited

  10. (EDMT, CVPRw 2017) Enhancing Detection Model for Multiple Hypothesis Tracking

  11. (QuadMOT16, CVPR 2017) Multi-Object Tracking with Quadruplet Convolutional Neural Networks

  12. (oICF, AVSS 2016) Online multi-person tracking using Integral Channel Features

MOT 2015

  1. (AMIR15, Arxiv 2017) Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies

  2. (JointMC, CoRR 2016) A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects

  3. (HybridDAT, TIP 2016) A Hybrid Data Association Framework for Robust Online Multi-Object Tracking

  4. (AM, Arxiv 2017) Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism

  5. (TSMLCDEnew, Arxiv 2015) Tracklet Association by Online Target-Specific Metric Learning and Coherent Dynamics Estimation

  6. (QuadMOT, CVPR 2017) Multi-Object Tracking with Quadruplet Convolutional Neural Networks

  7. (NOMT, ICCV 2015) Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor

  8. (TDAM, CVIU 2016) Temporal Dynamic Appearance Modeling for Online Multi-Person Tracking

  9. (MHT_DAM, ICCV 2015) Multiple Hypothesis Tracking Revisited

  10. (MDP, ICCV 2015) Learning to Track: Online Multi-Object Tracking by Decision Making

  11. (CNNTCM, CVPRw 2016) Joint Learning of Siamese CNNs and Temporally Constrained Metrics for Tracklet Association

  12. (SCEA, CVPR 2016) Online Multi-Object Tracking via Structural Constraint Event Aggregation

  13. (SiameseCNN, CVPRw 2016) Learning by tracking: Siamese CNN for robust target association

  14. (TBX, Arxiv 2016) Tracking with multi-level features

  15. (oICF, AVSS 2016) Online multi-person tracking using Integral Channel Features

  16. (TO, WACV 2016) Leveraging single for multi-target tracking using a novel trajectory overlap affinity measure

References

This repository contains references for papers and code for the Multiple Object Tracking 2017 (MOT17) project. To reduce the repository size, most documents are provided as links.

Papers

Most Relevant

  • (ECCV16) Simple Online and Realtime Tracking with a Deep Association Metric. [pdf] [code]

  • (Arxiv17) NoScope: 1000x Faster Deep Learning Queries over Video. [project] [pdf] [code]

  • (ICCV17) Focal Loss for Dense Object Detection. [pdf]

  • (Arxiv16) On The Stability of Video Detection and Tracking. [pdf]

  • (Arxiv17) Optimizing Deep CNN-Based Queries over Video Streams at Scale. [pdf] [code]

  • (CVPR13) Visual Tracking via Locality Sensitive Histograms. [project] [pdf] [code]

CVPR 2017

  • (CVPR17) A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects. [pdf]

  • (CVPR17) Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications. [pdf]

  • (CVPR17) Multiple People Tracking by Lifted Multicut and Person Re-identification. [pdf]

  • (ICML17) Analysis and Optimization of Graph Decompositions by Lifted Multicuts. [pdf]

  • (CVPR17) Densely Connected Convolutional Networks. [pdf] [code]

  • (CVPR17) Feature Pyramid Networks for Object Detection. [pdf]

Notes on Multiple Object Tracking: Part 1

Given a video contains moving objects of a specific class (e.g., pedestrian, vehicles, etc.), the task of multiple object tracking (MOT) is to locate all the objects of interest and associate them to find their correspondence across time.

Tracking-by-detection is recently the most successful paradigm among MOT methods. The paradigm separate tracking into two stages. First, an object detector is applied to each video frame. In a second step, a tracker is used to associate these detections to tracks. This note makes a survey of tracking-by-detection methods, where the input is a video and all the detections, and the output is the tracking results.

Lets first refer to a very simple basic implementation of multiple object tracking, and see what problems it may produce and try to improve this implementation.

(Note: For clarity, we name the objects we have tracked over time, i.e. frame 1~t, as tracks and the unassociated ones detected in a new frame, i.e. frame t+1, as detections)

1. A Basic Tracker

IOU17 Principle

A simple and intuitive idea is to associate the detections in consecutive frames by their spatial overlap between time steps. The detections with the highest IOU (Intersection-Over-Union) could probably belong to the same object. This gets closer to the truth when the frame rate is high and when the detector becomes increasingly reliable.

With the basic idea, we complete the implementation by answering the following questions.

  1. How to find the correspondence between last tracks and new detections?

    A greedy method. For a track, we compute its IOUs with all the detections. If the max IOU is bigger than a threshold (e.g. 0.5), we add the detection to the track, then remove it from the detections set. We loop over all the tracks to find their corresponding detections.

  2. How to determine the initialization and the termination of a track?

    An unassociated detection is initialized as a new track, and a track without corresponding detection will be removed (i.e. a termination).

  3. How to filter out the false positives in detections?

    By removing: 1. short tracks (filtering out all tracks with a length shorter than a number, e.g. 3). 2. low scoring tracks (remove all tracks without at least one detection with a score above a number, e.g. 0.3).

  4. How to improve the completeness of a track?

    The key is the use of low scoring detections. "Requiring a track to have at least one high-scoring detection ensures that the track belongs to a true object of interest while benefiting from lowscoring detections for the completeness of the track."

IOU17 Algorithm

The simple implementation forms our first reviewed paper with its code publicly available:

  • E. Bochinski, V. Eiselein and T. Sikora, ``High-speed tracking-by-detection without using image information'', AVSS 2017. [pdf] [code]

Despite its simpleness, it runs very fast (100K fps) and still achieves an average rank 7.2 on MOT17 with MOTA score 45.5 (with EB detector, see here for details).

2. Appearance and Motion Models

The basic tracker is efficient but vulnerable, i.e., an occlusion or a missing of a detection will terminate a track imediately; when the objects/camera move fast or when the frame rate is low, the IOUs between correponding detections would be small or even close to 0, and the costs/similarity scores become less reliable. Moreover, the greedy assignment process is problematic when interactions or mutual occlusions happen among close objects.

Revisiting the basic tracker, we could find several algorithm modules: the intialization and termination processes, a pair-wise cost function (IOU criterion) and an assignment process (a greedy method for finding the correspondences). Before we start designing a better tracking algorithm, let's list out the modules we'd like to improve:

We'd like to:

  • reduce the false positives (wrong detections) in the initialization process, and the false negatives (occlusions or missing detections) in the termination process. (- by lazy evaluation)

  • improve the pair-wise cost/similarity functions. (- by introducing appearance and motion models)

  • choose a better optimizer for the assignment problem instead the greedy solver. (- by the Hungarian algorithm)

Now let's start the designing.

  1. Initialization.

    Tentative. After 3 frames of detections, change to Active.

  2. Termination.

    Only after 30 frames of lost, mark as lost.

  3. Appearance Model.

    Siamese CNN.

  4. Motion Model.

    Kalman Filters.

  5. Pair-wise Similarity.

    Siamese CNN score. IOU with Kalman predicted box.

  6. Assignment Problem Solver.

    Hungarian Algorithm.

The above briefs the deepsort algrithm of the following paper (source code available):

  • W. Nicolai, B. Alex and P. Dietrich, ``Simple Online and Realtime Tracking with a Deep Association Metric'', Arxiv 2017. [pdf] [code]

The deepsort algorithm runs at approximately 40Hz and achieves an MOTA 61.4 with high performance detections (see here).

3. Rethinking The State Transitions of An Object

So far, we treat the initialization, termination, lost and rediscover of objects as trivial problems that were solved by some hand-craft simple tricks. Let's revisit the state changes of an object in a different perspective.

There are only four possible states for an object during tracking: appearance, disappearance, tracked, lost. An object that is lost might be rediscovered in later frames, while the one that is determined as disappeared is considered never to come back to the scene again. Therefore, the lifetime of an object could be modeled as transitions among its four possible states, which is exactly: the Markov Decision Process (MDP).

Markov Decision Process

The MDP consists of the tuple (S, A, T, R), where S is the state set, A denotes the action set, T: S x A -> S is the state transition function descibes the effect of each action in each state, R: S x A -> R denotes the real-valued reward function, it defines the immediate reward received after executing action a to state s. Name the "appearance" and "disappearance" states as "Active" and "Inactive", the possible transitions between the four states are dispicted in the above figure. Only seven transitions/actions are possible.

In MDP, a policy π is a mapping from the state space S to the action space A, i.e., π : S -> A. So the rest tasks is to define the policies in three states (excluding the Inactive state since an inactive object will never appear again).

  1. Policy in an Active State

  2. Policy in a Tracked State

  3. Policy in a Lost State

The idea is originated in the following paper, with public source code available:

  • Xiang, Yu and Alahi, Alexandre and Savarese, Silvio, ``Learning to Track: Online Multi-Object Tracking by Decision Making'', ICCV 2015. [pdf] [code]

4. Rethinking The Features

How many features can we extract and utilize to associate tracks with detections, or to determine the pair-wise similarity/cost between a track and a detection?

  1. Appearance Features

  2. Motion Features

  3. Interaction Features

MDPNN

  • A. Sadeghian, A. Alahi and S. Savarese, ``Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies'', CVPR 2017. [pdf] [project]

5. Rethinking The Assignment Problem

MultiCut. Linear Programing + xxx. Node labeling and graph decomposition. etc.

6. Tracklet Association

7. An End-to-End Fasion

Notes on Multiple Object Tracking: Part 2

Random notes.

Tracking The Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies (Arxiv 2017)

Overview of the paper

Overview

Features used in multiple object tracking (MOT) include no more than appearance, motion and interaction features. Two questions need to be asked before using them in MOT:

  1. How to combine the features?
  2. How to model long term dependencies?

This paper studies the above problems.

Note in MOT, there are objects that we have continuously tracked through frame 1~t, and detections in frame t+1 that we wish to assign them to the tracked objects.

This paper models the appearance, motion and interaction models independently as RNNs (marked as module RNN), then the outputs of the three RNNs are feeded to yet another RNN (marked as target RNN) to generate similarity scores between objects and detections. With the similarity (cost) matrix obtained, the assignment problem is solved by the Hungarian algorithm.

  1. The module RNN solves the problem of long term dependencies.
  2. The target RNN solves the problem of feature selection/combination.

Details

All the following RNNs are LSTMs.

  1. Appearance Model

Basic input feature: raw content.

CNN followed by an RNN for object feature extraction. Then the object feature is concatenated with detection CNN feature, and feeded to fully connected layers to generate a final k dimentional appearance feature.

  1. Motion Model

Basic input feature: velocity vector (vx, vy).

An RNN that accepts as inputs the velocity vector for extracting the H dimentional object feature, and a fully connected layer for extracting detection feature. The two features are concatenated and feeded to a FCN layer to generate a final k dimentional motion feature.

  1. Interaction Model

Basic input feature: flattened occupancy grid. Separate image into equal grids and neighboring grids of an object are annotated as 1 if there is another target, otherwise 0.

An RNN that accepts as inputs the flattened occupancy grid for extracting the H dimentional object feature, and the FCN layer for extracting detection feature. The two features are concatenated and feeded to a FCN layer to generate a final k dimentional interaction feature.

  1. Target Model

An RNN followed by FCN layers that accepts as inputs the concatenated 3k dimentional features as outputs the similarity score between an object and a detection.

  1. Training Process

First, each RNN as well as the CNN is pre-trained separately with a standard softmax classifier and cross-entropy loss, positive indicates matched object and detection and negative otherwise. Second, the target RNN is jonitly trained end-to-end with the component RNNs.

Effectiveness

MOTA 47.2 on MOT16 and 37.6 on 2DMOT15, runs at 1 Hz.

Reports that the history (long-term dependencies in LSTMs) works, the combination works, and each cue matters.

Learning to Track: Online Multi-Object Tracking by Decision Making (MDP, ICCV 2015)

Overview

Each object in MOT may fall in on of four states: active, inactive, tracked, lost.

  • Active: initial state of any target. Whenever an object is detected.
  • Tracked: confirmed as a true positive from an object detector.
  • Lost: the target is lost due to some reasons like occlusion, out of view or disappear.
  • Inactive: the target is confirmed lost, and stay inactive forever.

The paper formulates the MOT as decision making in Markov Decision Processes (MDPs).

  • State Space: the four states active, inactive, tracked, lost.
  • Action Space: feasible trasitions from one state to another.
  • Transition Function: describes the effect of each action in each state.
  • Reword Funtion: defines reward received after executing an action to a state.

The rest of the paper is to design reward functions (sometimes trainable models) for seven possible transitions in the state space.

The main contribution of the paper is it proposed a framework for modeling the object states and its state transitions.