VehicleMAE: A Python repository from Event-AHU

Official PyTorch implementation of Structural Information Guided Multimodal Pre-training for Vehicle-centric Perception, Xiao Wang, Wentao Wu, Chenglong Li, Zhicheng Zhao, Zhe Chen, Yukai Shi, Jin Tang, AAAI-2024 [arXiv] [Poster]

Abstract

Understanding vehicles in images is important for various applications such as intelligent transportation and self-driving systems. Existing vehicle-centric works typically pre-train models on large-scale classification datasets and then fine-tune them for specific downstream tasks. However, they neglect the specific characteristics of vehicle perception in different tasks and might thus lead to sub-optimal performance. To address this issue, we propose a novel vehicle-centric pre-training framework called VehicleMAE, which incorporates the structural information including the spatial structure from vehicle profile information and the semantic structure from informative high-level natural language descriptions for effective masked vehicle appearance reconstruction. To be specific, we explicitly ex-tract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction. The more comprehensive knowledge distilled from the CLIP big model based on the similarity between the paired/unpaired vehicle image-text sample is further taken into consideration to help achieve a better understanding of vehicles. A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information. Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our VehicleMAE.

Video Tutorial

Video Tutorial for this work can be found by clicking the image below:

Our Proposed Framework VehicleMAE

Environment Setting

Configure the environment according to the content of the requirements.txt file.

Dataset Download

Baidu Netdisk Link ：download

Extracted code ：tpds

Pre-trained Model Download

Pre-trained Model	Vit-base
Pre-trained checkpoint	download
Extracted code	6zkx

Training

#If you pre-training VehicleMAE using a single GPU, please run.
CUDA_VISIBLE_DEVICES=0 python main.py
#If you pre-training VehicleMAE using multiple GPUs, please run.
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py

Experimental Results

We used full fine-tuning to test the pre-trained model on four downstream tasks. The results are shown in the table below.

Method	Dataset	VAR			V-Reid		VFR	VPS
Method	Dataset	mA	Acc	F1	mAP	R1	Acc	mIou	mAcc
Scratch	-	84.67	80.86	84.90	35.3	57.3	24.8	49.36	59.22
MoCov3	Imagenet1K	90.38	93.88	95.33	75.5	94.4	91.3	73.17	78.60
DINO	Imagenet1K	89.92	91.09	93.11	64.3	91.5	-	68.43	73.37
IBOT	Imagenet1K	89.51	90.17	92.37	68.9	92.6	81.1	66.03	71.06
MAE	Imagenet1K	89.69	93.60	95.08	76.7	95.8	91.2	69.54	75.36
MAE	Autobot1M	90.19	94.06	95.43	75.5	95.4	91.3	69.00	75.36
VehicleMAE	Autobot1M	92.21	94.91	96.17	85.6	97.9	94.5	73.29	80.22

The four downstream tasks are vehicle attribute recognition (VAR), vehicle re-identification (V-Reid), vehicle fine-grained recognition (VFR), and vehicle partial segmentation (VPS).

Visual Results

Acknowledgement

[MAE] [BDCN] [CLIP]

Citation

If you find this work helps your research, please cite the following paper and give us a star.

@misc{wang2023structural,
      title={Structural Information Guided Multimodal Pre-training for Vehicle-centric Perception}, 
      author={Xiao Wang and Wentao Wu and Chenglong Li and Zhicheng Zhao and Zhe Chen and Yukai Shi and Jin Tang},
      year={2023},
      eprint={2312.09812},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

if you have any problems with this work, please leave an issue.

Event-AHU/VehicleMAE