/Awesome-VLM-AD-ITS

This repository collects research papers of large Vision Language Models in Autonomous driving and Intelligent Transportation System. The repository will be continuously updated to track the latest update.

Apache License 2.0Apache-2.0

Vision Language Models in Autonomous Driving: A Survey and Outlook 🚘

arXiv Badge Stars Badge Forks Badge Pull Requests Badge Issues Badge License Badge

This repository collects research papers on Vision Language Models in Autonomous Driving: A Survey and Outlook. The repo maintained by TUM-AIR will be continuously updated to track the latest work in the community.

Keywords: Vision Language Model, Large Language Model, Autonomous Driving, Intelligent Vehicle, Conditional Data Generation, Decision Making, Language-guided Navigation, End-to-End Autonomous Driving

🔥 Update

🤝   Citation

Please visit Vision Language Models in Autonomous Driving: A Survey and Outlook for more details and comprehensive information. If you find our paper and repo helpful, please consider citing it as follows:

@ARTICLE{10531702,
  author={Zhou, Xingcheng and Liu, Mingyu and Yurtsever, Ekim and Zagar, Bare Luka and Zimmer, Walter and Cao, Hu and Knoll, Alois C.},
  journal={IEEE Transactions on Intelligent Vehicles}, 
  title={Vision Language Models in Autonomous Driving: A Survey and Outlook}, 
  year={2024},
  pages={1-20},
  keywords={Autonomous vehicles;Task analysis;Planning;SData models;Surveys;Computational modeling;Visualization;Vision Language Model;Large Language Model;Autonomous Driving;Intelligent Vehicle;Conditional Data Generation;Decision Making;Language-guided Navigation;End-to-End Autonomous Driving},
  doi={10.1109/TIV.2024.3402136}}

📃 Introduction

The applications of Vision-Language Models (VLMs) in the fields of Autonomous Driving (AD) and Intelligent Transportation Systems (ITS) have attracted widespread attention due to their outstanding performance and the ability to leverage Large Language Models (LLMs). By integrating language data, the vehicles, and transportation systems are able to deeply understand real-world environments, improving driving safety and efficiency

🌟 Large VLMs in Autonomous Driving

Perception and Understanding

Method Year Task Code Link
The Traffic Scene Understanding and Prediction Based on Image Captioning 2020 Image Captioning
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision 2023 Pedestrian Detection Github
Unsupervised Multi-view Pedestrian Detection 2023 Pedestrian Detection
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving 2023 Single Object Referring
Referring Multi-Object Tracking 2023 Multiple Objects Referring and Tracking Github
Language Prompt for Autonomous Driving 2023 Multiple Objects Referring and Tracking Github
OpenScene: 3D Scene Understanding with Open Vocabularies 2023 Open-Voc 3D Semantic Segmentation Github
CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP 2023 Open-Voc 3D Semantic Segmentation Github
Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving 2023 Open-Voc 3D Object Detection and Tracking
Zelda: Video Analytics using Vision-Language Models 2023 Language-guided Video Retrieval
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario 2023 Visual Question Answering Github
Talk2BEV: Language-Enhanced Bird's Eye View (BEV) Maps 2023 Visual Spatial Reasoning, Open-loop Decision making Github
Semantic Anomaly Detection with Large Language Models 2023 Semantic Anomaly Detection
Embodied Understanding of Driving Scenarios 2024 Visual Spatial Reasoning Github
MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding 2024 Visual Spatial Reasoning Github

Navigation and Planning

Method Year Task Code Link
Talk to the vehicle: Language conditioned autonomous navigation of self driving car 2019 Language-Guided Navigation
Ground then Navigate: Language-guided Navigation in Dynamic Scenes 2022 Language-Guided Navigation
ALT-Pilot: Autonomous navigation with Language augmented Topometric maps 2023 Vision-Language Localization, Language-Guided Navigation Page
GPT-Driver: Learning to Drive with GPT 2023 Motion Planing Github
Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving 2023 Trajectory Prediction
DRIVEVLM: The Convergence of Autonomous Driving and Large Vision-Language Models 2024 Trajectory Prediction, Motion Planning Github
Text-to-Drive: Diverse Driving Behavior Synthesis via Large Language Models 2024 Trajectory Prediction, Motion Planning Github

Decision-Making and Control

Method Year Task Code Link
Advisable Learning for Self-driving Vehicles by Internalizing Observation-to-Action Rules 2020 Open-loop Decision-Making
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving 2023 Open-loop Decision-Making
Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles 2023 Open-loop Decision-Making, Motion Planing
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving 2023 Open-loop Control, Visual Spatial Reasoning Github
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models 2023 Closed-loop Decision-Making
SurrealDriver: Designing Generative Driver Agent Simulation Framework in Urban Contexts based on Large Language Model 2023 Closed-loop Decision-Making
Drive Like a Human: Rethinking Autonomous Driving with Large Language Models 2024 Closed-loop Decision-Making Github

End-to-End Autonomous Driving

Method Year Task Code Link
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model 2023 Open-loop Control, Visual Question Answering
ADAPT: Action-aware Driving Caption Transformer 2023 Open-loop Decision-Making, Visual Spatial Reasoning Github
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving 2023 Closed-loop Control Github
VLP: Vision Language Planning for Autonomous Driving 2023 Open-loop Control, 3D Object Detection and Tracking

Data Generation

Method Year Task Code Link
DriveGAN: Towards a Controllable High-Quality Neural Simulation 2021 Conditional Video Generation Page
GAIA-1: A Generative World Model for Autonomous Driving 2023 Conditional Video Generation Page
DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving 2023 Conditional Video Generation Github
DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model 2023 Conditional Multi-view Video Generation Github
BEVControl: Accurately Controlling Street-view Elements with Multi-perspective Consistency via BEV Sketch Layout 2023 Conditional Image Generation
DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation 2023 Conditional Video Generation Github

🌟 Large VLMs in Intelligent Transportation Systems

ITS Perception and Understanding

Method Year Task Code Link
A Multi-granularity Retrieval System for Natural Language-based Vehicle Retrieval 2022 Language-Guided Vehicle Retrieval Page
Tracked-Vehicle Retrieval by Natural Language Descriptions With Multi-Contextual Adaptive Knowledge 2023 Language-Guided Vehicle Retrieval Page
A Unified Multi-modal Structure for Retrieving Tracked Vehicles through Natural Language Descriptions 2023 Language-Guided Vehicle Retrieval Page
Traffic-Domain Video Question Answering with Automatic Captioning 2023 Image Captioning, Visual Question Answering
Causality-aware Visual Scene Discovery for Cross-Modal Question Reasoning 2023 Visual Question Answering
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer 2023 Visual Question Answering Github
Delving into CLIP latent space for Video Anomaly Recognition 2023 Video Anomaly Recognition Github

ITS Management System

Method Year Task Code Link
LLM Powered Sim-to-real Transfer for Traffic Signal Control 2023 Traffic Signal Control

🌟 Dataset

Autonomous Driving Datasets

Dataset Year Task Data Link
Pedestrian Detection: A Benchmark 2009 2D OD Link
Vision meets robotics: The kitti dataset 2012 2D/3D OD, SS, OT Link
The Cityscapes Dataset for Semantic Urban Scene Understanding 2016 2D/3D OD, SS Link
Citypersons: A diverse dataset for pedestrian detection 2017 2D OD Link
SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences 2019 3D SS Link
Cityflow: A city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification 2019 OT, ReID Link
nuscenes: A multimodal dataset for autonomous driving 2020 2D/3D OD, 2D/3D SS, OT, MP Link
BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning 2020 2D OD, 2D SS, OT Link
Scalability in Perception for Autonomous Driving: Waymo Open Dataset 2020 2D/3D OD, 2D/3D SS, OT Link

Language-Enhanced Autonomous Driving Datasets

Dataset Year Task Data Link
Textual explanations for self-driving vehicles 2018 Textural Explanation Link
Object referring in videos with language and human gaze 2018 Object Detection
Touchdown: Natural language navigation and spatial reasoning in visual street environments 2019 Visual-Spatial Reasoning, Vision-Language Navigation Link
Talk to the vehicle: Language conditioned autonomous navigation of self driving cars 2019 Vision-Language Navigation
rounding human-to-vehicle advice for self-driving vehicles 2019 Human-to-Vehicle Advice Link
Talk2car: Taking control of your self-driving car 2020 Single Object Reffering Link
Cityflow-nl: Tracking and retrieval of vehicles at city scale by natural language descriptions 2021 Vihicle Retrival, Object Tracking
Ground then navigate: Language-guided navigation in dynamic scenes 2022 Vision-Language Navigation
Language prompt for autonomous driving 2023 Object Tracking Link
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario 2023 Visual Question Answering Link
Referring multi-object tracking 2023 Object Tracking Link
Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving 2023 Visual-Spatial Reasoning, Decision Making Link
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving 2023 Visual Question Answering Link
DRAMA: Joint Risk Localization and Captioning in Driving 2023 Iamge Captioning, Visual Question Answering Link
Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning 2023 Importance Ranking, Visual-Spatial Reasoning

Language-Enhanced Intelligent Transportation Systems Datasets

Dataset Year Task Data Link
Future Frame Prediction for Anomaly Detection – A New Baseline 2018 Anomaly Detection Link
Real-world Anomaly Detection in Surveillance Videos 2018 Anomaly Detection Link
Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events 2021 Visual Question Answering Link
AerialVLN: Vision-and-Language Navigation for UAVs 2023 Vision-Language Navigation Link

License

This repository is released under the Apache 2.0 license.