/track_and_count

Track and count using deep learning

Primary LanguageJupyter NotebookMIT LicenseMIT

The electronic observer project (track & count)

Project objective and description

Voting is a key procedure enabling society to select their representatives and to hold them accountable for their performance in office. It plays a vital role in the communication of the society's needs and demands straight up to the political institutes. Both society and politicians are interested in a transparent and trustable procedure guaranteeing the legitimacy of the choice made. One of the mechanisms to assure the fairness of the procedure is observation. Usually, observers are people representing different political parties and public organizations whose primary task is to monitor the fairness of the election procedure as well as the correct counting of votes. Good observation prevents or at least limits the fraud and increases the legitimacy of the result. Here we propose a computer vision algorithm that aims to count the number of unique people voting during the election day. Shortly speaking, this is an electronic observer. At the end of the day, the counted number of votes can be compared with the official turnout at the polling station. A large discrepancy between the two is a signature of the fraud and signalize that the video should be more carefully examined by independent observers to look for any stuffing evidence.

Traditional methods used to determine electoral fraud are based on a statistical analysis of irregularities in Vote-Turnout distributions. Among the most commonly observed anomalies are coarse vote percentages, zero-dispersion distribution of votes within one electorate, and a peak in the distribution of votes at high turnout rates for one candidate. Such electoral statistical methods are very developed in Russia where a large array of data is collected and analyzed by Dr. Sergey Shpilkin (rus. Сергей Шпилькин). However, the statistical analysis is relatively difficult to explain to a general public with a highly varying level of mathematical education. Our algorithm, in turn, provides visual and simple interpretable results: the demonstration of ballot staffing on a video is a clear argument that is difficult to reject. Importantly, our algorithm does not gather any personal information since it does not use face recognition technology. We test our algorithm on short video samples available publicly on YouTube. These samples were recorded at polling stations in Russia where video cameras were installed installed in 2012.

Gif example

Important notes on implementation

This is an offline algorithm meaning that counting is done on a recorded video sample but not on an online stream. In general, the task is highly difficult because the camera model, settings and view vary a lot and should be taken as it is. The algorithm is split into 3 stages to ensure its reliable work and to separate time-consuming video processing steps from relatively light post-processing. First, all urns are detected from the set of screenshots, and their coordinates are saved into a file. Since an urn is a stationary object (i.e. its position is not supposed to change in time), one can save a lot of computational time by detecting it only once at pre-processing. Moreover, such an approach enables us to avoid running expensive video processing in case the urn detection is failed. Second, we count unique voters using a set of pre-defined criteria to recognize their actions. All people in a video are tracked with a unique ID number being assigned to each person. If a person is counted, the crops of its image, the coordinates of its skeleton, and his/her appearance features are saved for each frame into a separate folder with an ID name for further postprocessing. This is the most time-consuming step, and it is important to save a maximum of relevant information for the analysis. Finally, the appearance features can be analyzed to compare the tracklets and identify the same persons. Moreover, the skeleton trajectories can be analyzed to further sort out the most anomalous voters which were counted by mistake. The algorithm uses several external codes as libraries: YOLOv5 (for object detection), DeepSORT (for tracking) and AlphaPose (for pose/skeleton estimation). The code can be easily built, run, and deployed using the Docker file provided with the repository.

How to detect urns.

  1. Extract some snapshot frames into snapshot_frames folder

    python3 utils/extract_frames.py --source video_examples/election_2018_sample_1.mp4 --destination snapshot_frames --start 1 --end 10000 --step 1000

  2. Run the detector which saves the coordinates into .txt file in urn_coordinates folder

    python3 yolov5/detect.py --weights urn_detection_yolov5/weights_best_urn.pt --img 416 --conf 0.2 --source snapshot_frames --output urn_coordinates --save-txt

How to run the trackers

  1. Follow the installation steps described in INSTALL.md

  2. Run the script counting the number of unique people approaching an urn.

    python3 track_yolov5_counter.py --source video_examples/election_2018_sample_1.mp4 --weights yolov5/weights/yolov5s.pt

Other scripts:

  1. Run the tracker: YOLOv5 + (SORT or Deep SORT)

    python3 track_yolov5_sort.py --source example/running.mp4 --weights yolov5/weights/yolov5s.pt --conf 0.4 --max_age 50 --min_hits 10 --iou_threshold 0.3

    python3 track_yolov5_deepsort.py --source example/running.mp4 --weights yolov5/weights/yolov5s.pt

  2. Run the tracker with pose estimation

    python3 track_yolov5_pose.py --source example/running.mp4 --weights yolov5/weights/yolov5s.pt

  3. Run the feature extractor

    python3 deepsort_features.py --source example/running.mp4 --weights yolov5/weights/yolov5s.pt

Output description

Table of contents

Custom object detection

The implementation of custom object detection could be found in a folder urn_detection_yolov5. First, the dataset of urn pictures was collected (see urn_detection_yolov5/collecting_urn_dataset.doc for details). Note that the dataset has already been augmented with different brightness levels to simulate the effect of illumination in a room and/or bad camera settings. The dataset can be downloaded with curl. Then, the YOLOv5 detector is applied with 2 classes of objects specified: an urn (a custom object) and a person (a coco object). The neural network is then fine-tuned to learn about the custom object class. Finally, the inference is done on a subset of data and the result is visualized.

Example of urn detection with YOLOv5

Tracking

In the second part of the project, we track people in a room using the tracking-by-detection paradigm. As it has been done earlier in the custom object detection section, YOLOv5 performs a person detection on each single video frame. Then, the detections on different frames must be associated with each other to re-identify the same person. The SORT tracker combines the linear Kalman filter to predict the state of the object (the motion model) and the Hungarian algorithm to associate objects from the previous frames with objects in the current frame. The tracker does not consider any details of the object's appearance. My implementation of the SORT tracker inside the YOLOv5 inference script could be found in track_yolov5_sort.py. The Jupyter notebook colabs/run_sort_tracker_on_colab.ipynb shows how to run the tracker on Google Colab.

Example of tracking in a room using SORT and YOLOv5

Gif example

A nice alternative to the SORT tracker is a Deep SORT. The Deep SORT extends the SORT tracker adding a deep association metric to build an appearance model in addition to the motion model. According to the authors, this extension enables to track objects through longer periods of occlusions, effectively reducing the number of identity switches. My implementation of the tracker inside the YOLOv5 inference script could be found in track_yolov5_deepsort.py. The Jupyter notebook colabs/run_deepsort_tracker_on_colab.ipynb shows how to run the tracker on Google Colab.

Count

Since our primary task is to count the number of unique voters but not the total number of people in a room (like kids who just accompany the adults), it is important to define the voting act in a more precise way. Both an urn and voters are identified using the YOLOv5 detector which puts a bounding box around each of them. To vote, a person must come close to an urn and spend a certain amount of time around (i.e. the distance between the object centroids must be within a certain critical radius). This "certain amount of time" is necessary to distinguish the people who pass by and the ones who vote. This approach requires two predefined *parameters:

  • Critical radius
  • Minimum interaction time

The person whose motion satisfies the conditions defined above can be then tracked until he/she disappears from the camera view. The tracking is necessary in case the person stays in a room hanging around for a while. To further ensure that we count the unique people only, one can save an image of each tracked person inside the bound box building a database of voters in a video. When the dataset of images with voters is built, one can run a neural network to find the unique voters based on the similarity of their appearance.

Reidentification

Both trackers listed above possess only short-term memory. The object's track is erased from memory after max_age number of frames without associated detections. Typically, max_age is around 10-100 frames. If a person leaves a room and comes back in a while, the tracker will not re-identify the person assigning a new ID instead. To solve this issue, one needs long-term memory. Here we implement long-term memory by means of appearance features from the Deep Sort algorithm. An appearance feature vector is a 1D array with 512 components. For each track ID we create a separate folder into which we write feature vectors. Feature vectors files are labeled in their names with frame number index where the object has been detected. When a new track is identified, one can compute the cosine distance between this track and all saved tracks in appearance space. If the distance is smaller than some threshold value, an old ID could be reassigned to a new object. Long-term memory enables us to exclude the security guards or the election board members who approach an urn frequently.

Feature extractor script is deepsort_features.py. Besides the standard output video file, it also writes features and corresponding cropped images of tracked objects being saved into inference/features and inference/image_crops folders, respectively. The log file with the dictionary storing the history of object detections is in inference/features/log_detection.txt. The keys of this dictionary are track IDs and values are the lists with frame numbers where the corresponding track has been registered. Moreover, we save frames per second rate which enables us to restore the time (instead of frame number) when the track is detected.

Content:

  • track_yolov5_sort.py implements the SORT tracker in YOLOv5
  • track_yolov5_deepsort.py implements the Deep SORT tracker in YOLOv5
  • colabs/run_sort_tracker_on_colab.ipynb and colabs/run_deepsort_tracker_on_colab.ipynb shows how to run the trackers on google colab.
  • track_yolov5_counter.py runs a counter
  • deepsort_features.py implements the feature extractor
  • folder 'theory' contains the slides with summary of theoretical approaches

Future work and implementations.

Literature

Codes

Habr