Efficient Multi-Object Tracking for Edge devices

Note: The arxiv version of the paper can be found here

Object tracking is a representative computer vision task to detect and track objects throughout the video. Object tracking systems are widely used in important applications like autonomous driving, surveillance from security cameras and activity recognition. Multi-object tracking is a subgroup of object tracking that tracks multiple objects belonging to one or more categories by identifying the trajectories as the objects move through consecutive video frames. Multi-object tracking has been widely applied to autonomous driving, surveillance with security cameras, and activity recognition. A popular paradigm for MOT is tracking-by-detection, which first detects objects by marking them with object class labels and bounding boxes, computes the similarity between object detections, and associate tracklets by assigning IDs to detections and tracklets belonging to the same object. Online object tracking aims to process incoming video frames in real time as they are captured. However, deep neural networks (DNNs) powered multi-object trackers are compute-intensive, e.g., using convolutional neural networks (CNNs), such as YOLOv3 and Faster RCNN, for detecting objects. When deployed on edge devices with resource constraints, the video frame processing rate on the edge device may not keep pace with the incoming video frame rate. This mismatch can result in lag or dropped frames, ultimately diminishing online object tracking quality.

In this project, we focus on reducing the computational cost of multi-object tracking by selectively skipping detections while still delivering comparable object tracking quality. First, we analyze the performance impacts of periodically skipping detections on frames at different rates on different types of videos in terms of accuracy of detection, localization, and association. Second, we introduce a context-aware skipping approach that can dynamically decide where to skip the detections and accurately predict the next locations of tracked objects. Third, we conduct a systematic experimental evaluation on the MOTChallenge datasets, which demonstrates that the proposed approach can effectively reduce computation costs of multi-object tracking by skipping detections and maintain comparable multi-object tracking quality to the no skipping baseline.

Sample Results

When the incoming frame rate is greater than processing rate of the tracker, if the model skips performing detection on frames periodically and uses the detections from previous frames, the result of the tracking using FairMOT on a few videos from MOT-15 dataset can be seen below. It can be observed that the negative effects of skipping frames are more noticeable when the people in the scene as moving faster in between successive frames.

Video No frames skipped Every alternate frame (1/2) skipped 2 of every 3 frames skipped

Rather than dropping frames periodically, eigenvalue based similarity between successive frames can be computed and used to determine a particular frame can be dropped without much impact or if it should not be skipped. Below are the results when skipping detection on frames that are very similar to the previously detected frames.

Video No frames skipped Skipping detection using eigenvalue based similarity

Combining both the previous approaches to avoid accumulation of errors by skipping multiple consecutive frames by ensuring that detection is performed atleast on one in every given number of frames, the below results are observed

Video No frames skipped Skipping detection using eigenvalue based frame similarity & enforcing detection on atleast one in every 4 frames

Qualitative results for Context-aware skipping (using Kalman filter based estimation, Normalized cross correlation and Histogram of Oriented Gradients features based similarity) to be updated soon.

Usage Instructions

Environment setup:

conda create --name <env_name> python=3.7
conda activate <env_name>

git clone https://github.com/git-disl/EMO.git
cd EMO
pip install -r requirements.txt

git clone -b pytorch_1.7 https://github.com/ifzhang/DCNv2.git
cd DCNv2
sh make.sh

A trained model from https://github.com/ifzhang/FairMOT#pretrained-models-and-baseline-model can be used for the inference below

Script to run evaluation

Periodic skipping + reuse previous bounding box

python src/track_periodic.py mot --load_model <path_to_FairMOT_trained_model> --conf_thres 0.6 <--val_mot15 VAL_MOT15 / --val_mot17 VAL_MOT17> --data_path <path_to_MOT15/MOT17_dataset> --detect_frame_interval x

detect_frame_interval of x means x consecutive frames are skipped in between successive detections. i.e, detections are skipped on x of out every (x+1) frames.

Periodic skipping + estimation

python src/track.py mot --load_model <path_to_FairMOT_trained_model> --conf_thres 0.6 <--val_mot15 VAL_MOT15 / --val_mot17 VAL_MOT17> --data_path <path_to_MOT15/MOT17_dataset> --detect_frame_interval 4 --similarity_computation 'no' --adaptive_freq_forced_detection 'False'

Context-aware skipping (NCC/HOG + estimation)

python src/track.py mot --load_model <path_to_FairMOT_trained_model> --conf_thres 0.6 <--val_mot15 VAL_MOT15 / --val_mot17 VAL_MOT17> --data_path <path_to_MOT15/MOT17_dataset> --similarity_threshold <0-1, recommended range 0.7-0.9> --detect_frame_interval 4 --similarity_computation <'hog'/'ncc'> --adaptive_freq_forced_detection 'False'

To run the evaluation on a single video, --custom_video and --seq_name arguments can be used on the track & track_periodic scripts To compute HOTA metrics, src/metrics/run_mot_challenge.py can be used after updating src/metrics/mot_challenge_2d_box.py with the correct parameters


The inference time optimizations in this repository are developed on top of the following repositories

