Name | Contact | Student ID |
---|---|---|
Jose | j.c.padillacancio@student.tudelft.nl | 5224969 |
Mitali | m.s.patil@student.tudelft.nl | 5934060 |
Dean | d.polimac@student.tudelft.nl | 5060699 |
Nils | n.vanveen-3@student.tudelft.nl | 4917863 |
The paper "Data-Driven Feature Tracking for Event Cameras" [2] by Nico Messikommer et al. addresses the advantages of event cameras, such as their high temporal resolution and resilience to motion blur, which make them ideal for low-latency and low-bandwidth feature tracking, especially in challenging scenarios. However, existing feature tracking methods for event cameras often require extensive parameter tuning, are sensitive to noise, and lack generalization to different scenarios. To overcome these shortcomings, the authors introduce the first data-driven feature tracker for event cameras. Leveraging low-latency events to track features detected in a grayscale frame, their approach achieves performance through a novel frame attention module, enabling information sharing across feature tracks. By transferring knowledge from synthetic to real data and employing a self-supervision strategy, their tracker outperforms existing methods in relative feature age, maintaining the lowest latency, highlighting significant advancements in event camera feature tracking. While our experiments are not supporting the findings of the authors and this reproduction is mainly focused on setting up the authors code, we outline how their method works, what we tried to reproduce as well as a division of tasks among the group.
Event cameras, also known as dynamic vision sensors (DVS), are sensors that detect changes in brightness (events) asynchronously, unlike traditional cameras that capture frames at fixed intervals.
Event cameras offer advantages like high temporal resolution and low latency, but traditional feature tracking methods designed for frame-based cameras struggle to adapt to the asynchronous nature of event data.
Messikommer et al. propose a novel approach for feature tracking with event cameras that leverages the unique characteristics of event data. Their method is data-driven, meaning it learns directly from event data without relying on predefined features or handcrafted algorithms.
- Events are represented using a spatiotemporal feature representation capturing location, time, and polarity.
- This representation facilitates accurate feature tracking by providing a comprehensive understanding of each event's spatial and temporal context.
- Deep neural networks learn feature descriptors directly from event data, enabling robust matching across frames.
- These descriptors encode unique characteristics of each feature, including spatial patterns, temporal dynamics, and brightness changes.
- Temporal relationships between consecutive events are leveraged to ensure feature tracking consistency over time.
- Incorporating temporal information enables the model to maintain accurate correspondences between features across frames, even in challenging scenarios.
- Spatial displacement of features between consecutive frames is quantified to determine feature motion and trajectory over time.
- Techniques like optical flow estimation or feature matching are employed to compute displacement distances accurately.
- Features detected in one frame are projected onto subsequent frames to establish correspondences.
- This enables continuous tracking of features over time by associating them with their counterparts in consecutive frames.
The network architecture of the proposed event tracker comprises two main components: input and feature displacement prediction, and the frame attention module for information fusion. Both components are essential for accurately tracking features in dynamic environments.
- The event tracker receives input in the form of a reference patch
$P_0$ in a grayscale image$I_0$ and an event patch$P_j$ generated from an event stream$E_j$ at timestep$t_j$ . - Its primary objective is to predict the relative feature displacement
$\Delta \hat{f}_j$ between the reference patch and the event patch. - Individual feature processing is handled by a feature network, which integrates a ConvLSTM layer with a state
$F$ to ensure temporal consistency. - By leveraging a correlation map
$C_j$ derived from a template feature vector$R_0$ of the template patch encoder and the feature map of the event patch, the feature network accurately predicts the displacement.
- Introducing a novel frame attention module significantly enhances tracking performance by sharing information across different feature tracks within an image.
- This module combines processed feature vectors for all tracks in the image using self-attention and a temporal state
$S$ . - Leveraging self-attention mechanisms enables the model to prioritize relevant features across different tracks, resulting in improved tracking accuracy.
- The temporal state
$S$ captures dependencies between feature tracks over time, facilitating the consideration of feature evolution. - The fused information guides the computation of the final displacement
$\Delta \hat{f}_j$ , ensuring consistent and accurate feature tracking across frames.
The integration of individual feature processing and information fusion through the frame attention module empowers the proposed event tracker enables performance in tracking features within dynamic environments.
Figure 2: Overview of the Event Tracker Model Architecture
Feature tracking algorithms aim to track a given point in a reference frame in subsequent timesteps. They usually do this by extracting appearance information around the feature location in the reference frame, which is then matched and localized in subsequent ones. Following this pipeline, an image patch
To localize the template patch
To share information between features in the same image, a novel frame attention module is introduced. Since points on a rigid body exhibit correlated motion in the image plane, there is a substantial benefit in sharing information between features across the image. The frame attention module takes the feature vectors of all patches at the current timestep
The network is first trained on synthetic data from the Multiflow dataset, which contains frames, synthetically generated events, and ground truth pixel flow. A loss based on the L1 distance is directly applied for each prediction step
A novel pose supervision loss is introduced based solely on ground truth poses of a calibrated camera to adapt the network to real events. Ground truth poses can be obtained for sparse timesteps using structure-from-motion algorithms or external motion capture systems. The supervision strategy relies on the triangulation of 3D points based on poses, applicable only in static scenes. For each predicted track, the corresponding 3D point is computed using the direct linear transform. The final pose supervision loss is constructed based on the predicted feature and the reprojected feature for each available camera pose at timestep
We managed to reproduce some of the results of the paper. Namely recreating the benchmark results for the fine-tuned EC dataset. For this we used the authors model checkpoint that had been fine tuned on the EC dataset as well as their already preprocessed evaluation dataset which they provided as well. This was not without its complications which will be explained in the following section. The results obtained are also explained in a later section
We also attempted to recreate the fine tuning itself, taking their model checkpoint on the synthetic dataset and fine tuning it on a (smaller) sample of the EC dataset [1] and then evaluating it. This we did not manage to reproduce as we ran into a number of issues with the pre-processing pipeline and training script.
The initial step of the project was to try and reproduce the results on the EC dataset reported in the paper, which can be seen in the Tables below. This was done by using the pretrained weights provided in the GitHub repository and running the model on the provided datasets. Based on the results which we have achieved -shown in Table depicting FA - we can see that our results verify the ones reported in the original paper. There is a slight variance in the results for the Boxes Translation
and Boxes Rotation
, but it is not significant. In terms of Feature Ratio, we can also see in the Table below that there is no difference with the exception of the Boxes Rotation
where the pre-trained model performs better.
Sequence Name | FA (Our) | FA (Paper) |
---|---|---|
Shapes Translation | 0.855 | 0.856 |
Shapes Rotation | 0.793 | 0.793 |
Shapes 6DOF | 0.878 | 0.882 |
Boxes Translation | 0.844 | 0.869 |
Boxes Rotation | 0.700 | 0.691 |
Sequence Name | Inlier Ratio (Our) | Inlier Ratio (Paper) |
---|---|---|
Shapes Translation | 0.962 | 0.962 |
Shapes Rotation | 0.950 | 0.950 |
Shapes 6DOF | 0.946 | 0.946 |
Boxes Translation | 0.980 | 0.980 |
Boxes Rotation | 0.951 | 0.949 |
Install the dependencies pip install -r requirements.txt
. This wont work for torch, you need to supply the source for the dependency so
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
All large files used in this project are in this google drive file: here. Simply download it and extract its contents in the root directory (i.e. just the loose files and directories, they will match the structure needed).
- Update all instances of <path_to_repo> in
configs.eval_real_defaults.yaml
with the absolute path to the root directory of the repository - Update <path> string in
evluate_real.py
- Run
python evaluate_real.py
- Results of inference should be in
correlation3_unscaled/timestamp/
N.B. We provide our results so this step is not necessary.
- Move the results into
gt/network_pred/
(We provide our own results in the drive) - Run
python -m scripts.benchmark
- Results will be written to
out/benchmarking_results.csv
- During data preprocessing, some time query gives an error for being out of the pose data range. (can be fixed by removing the first entry: '0.000000' from the respective images.txt file)
- When running train.py, the same query issue (different number) mentioned above is encountered. (ultimate issue while running train.py, could not resolve as could not find entry with the query time)
- COLMAP instructions in the GitHub README of the original code were sometimes wrong , we had to refer the COLMAP documentation(we suggest the reader do the same)
- Pre-processing steps like feature-extract were very slow and took upto an hour to run.
- We also experienced dependency issues during the setup process. Not all dependencies listed in the requirements.txt file are functional, some also missing; e.g. torch needs to be manually downloaded as per the official documentation. Additionally, certain parts of the code rely on deprecated functionality, necessitating the downgrading of dependency versions.
- The preprocessed data for training EC is not provided. We had to download it here.
- The instructions to import and export the model on COLMAP are unclear, especially the which image folder is to be imported (we imported images_corrected).
- The intermediate files generated during the preprocessing stage are considerably large, with sizes upwards of 3 gigabytes per sequence.
- The train.py script contained a bug that required rewriting. Our attempted fix involves ensuring that a method is called on the class itself rather than on an instance of the class, although there's a possibility that our correction might be incorrect.
While the idea presented in the paper is interesting, the reproducibility of the it turned out to be difficult. While verifying the results using the pre-trained model was a success, training a model and testing it proved to be very difficult. The lack of proper documentation, instructions and data in the papre and the GitHub repository caused difficult hurdles. A lot of time was spent debugging issues regarding missing files/directories, trying to find datasets online and getting COLMAP to work as intended.
Member | Code | Blogpost |
---|---|---|
Jose | Reproduce results using fine-tuned checkpoint | Reproduction (sans Results) |
Mitali | Data pre-processing for POSE EC Dataset | Issues Encountered + Team details |
Dean | Debugging path & dependency issues when trying to train the model | Results + Conclusion |
Nils | Debugging the code when trying to train the model | Introduction + Method |
[1] E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, and D. Scaramuzza, “The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and SLAM,” The International Journal of Robotics Research, vol. 36, no. 2, pp. 142–149, Feb. 2017, doi: https://doi.org/10.1177/0278364917691115.
[2] N. Messikommer, C. Fang, M. Gehrig, and D. Scaramuzza, “Data-driven Feature Tracking for Event Cameras,” arXiv (Cornell University), Jan. 2022, doi: https://doi.org/10.48550/arxiv.2211.12826.