(https://ieeexplore.ieee.org/document/10490096)
This paper benefited from TSformer - VO, thank them for their contribution to (https://github.com/aofrancani/TSformer-VO)
Traditional deep learning-based visual odometry estimation methods typically involve depth point cloud data, optical flow data, images, and manually designed geometric constraints. These methods explore both temporal and spatial dimensions to facilitate the regression of pose data models, heavily relying on data-driven techniques and manual engineering designs. The aim of this paper is to pursue a simpler end-to-end approach, transforming such problems into image-based ones, achieving 6DoF visual odometry regression solely through continuous grayscale image data, thus reaching state-of-the-art performance levels. This paper introduces a novel monocular visual odometry network structure, leveraging the Swin Transformer as the backbone network, named SWformer-VO. It enables direct estimation of the six degrees of freedom camera pose under monocular camera conditions, utilizing a modest volume of image sequence data through an end-to-end methodology. SWformer-VO introduces an Embed module called "Mixture Embed," which fuses consecutive pairs of images into a single frame and converts them into tokens passed into the backbone network. This approach replaces traditional temporal sequence schemes by addressing the problem at the image level.Building upon this foundation, the paper continually improves and optimizes various parameters of the backbone network. Additionally, experiments are conducted to explore the impact of different layers and depths of the backbone network on accuracy. Excitingly, on the KITTI dataset, SWformer-VO demonstrates superior accuracy compared to common deep learning-based methods such as SFMlearner, Deep-VO, TSformer-VO, Depth-VO-Feat, GeoNet, Masked Gans, and others introduced in recent years. Moreover, the effectiveness of SWformer-VO is also validated on our self-collected dataset consisting of nine indoor corridor routes for visual odometry.
Download the KITTI odometry dataset (grayscale).
Create a simbolic link (Windows) or a softlink (Linux) to the dataset in the dataset
folder:
- On Windows:
mklink /D <path_to_your_project>\code_root_dir\data <path_to_your_downloaded_data>
- On Linux:
ln -s <path_to_your_downloaded_data> <path_to_your_project>/code_root_dir/data
The data structure should be as follows:
|---code_root_dir
|---data
|---sequences_jpg
|---00
|---image_0
|---000000.png
|---000001.png
|---...
|---image_1
|...
|---image_2
|---...
|---image_3
|---...
|---01
|---...
|---poses
|---00.txt
|---01.txt
|---...
- Create a virtual environment using Anaconda and activate it:
conda create -n vo python==3.8.0
conda activate vo
- Install dependencies (with environment activated):
pip install -r requirements.txt
PS: So far we are changing the settings and hyperparameters directly in the variables and dictionaries. As further work, we will use pre-set configurations with the argparse
module to make a user-friendly interface.
In train.py
:
- Manually set configuration in
args
(python dict); - Manually set the model hyperparameters in
model_params
(python dict); - Save and run the code
train.py
.
In predict_poses.py
:
- Manually set the variables to read the checkpoint and sequences.
Variables | Info |
---|---|
checkpoint_path | String with the path to the trained model you want to use for inference. Ex: checkpoint_path = "checkpoints/Model1" |
checkpoint_name | String with the name of the desired checkpoint (name of the .pth file). Ex: checkpoint_name = "checkpoint_model2_exp19" |
sequences | List with strings representing the KITTI sequences. Ex: sequences = ["03", "04", "10"] |
In plot_results.py
:
- Manually set the variables to the checkpoint and desired sequences, similarly to Inference
The evaluation is done with the KITTI odometry evaluation toolbox. Please go to the evaluation repository to see more details about the evaluation metrics and how to run the toolbox.
Please cite our paper you find this research useful in your work:
author={Wu, Zhigang and Zhu, Yaohui},
journal={IEEE Robotics and Automation Letters},
title={SWformer-VO: A Monocular Visual Odometry Model Based on Swin Transformer},
year={2024},
volume={9},
number={5},
pages={4766-4773},
keywords={Transformers;Visual odometry;Cameras;Training;Odometry;Image segmentation;Deep learning;Deep learning;monocular visual odometry;transformer},
doi={10.1109/LRA.2024.3384911}}