/DMR

Primary LanguagePythonOtherNOASSERTION

DMR

This repository contains the official code from ''DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning''

  • [2024/02/26]: DMR is accepted to CVPR 2024.
  • [2022/11/18]: DMR is currently under review for CVPR 2024.

Bibtex

If this work is helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{xu2024dmr,
  title={DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning},
  author={Xu, Haoran and Peng, Peixi and Tan, Guang and Li, Yuan and Xu, Xinhai and Tian, Yonghong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={26508--26518},
  year={2024}
}

Abstract

We explore visual reinforcement learning (RL) using two complementary visual modalities: frame-based RGB camera and event-based Dynamic Vision Sensor (DVS). Existing multi-modality visual RL methods often encounter challenges in effectively extracting task-relevant information from multiple modalities while suppressing the increased noise, only using indirect reward signals instead of pixel-level supervision. To tackle this, we propose a Decomposed Multi-Modality Representation (DMR) framework for visual RL. It explicitly decomposes the inputs into three distinct components: combined task-relevant features (co-features), RGB-specific noise, and DVS-specific noise. The co-features represent the full information from both modalities that is relevant to the RL task; the two noise components, each constrained by a data reconstruction loss to avoid information leak, are contrasted with the co-features to maximize their difference.

The overview of DMR learning framework:

framework

Several typical visual examples

  • Illustration of our motivation
motivation (i) In the first row, insufficient ambient light causes RGB underexposure, leading to the overlooking of the front pedestrian and resulting in a forward policy aligned with the lane direction that could cause collisions.

(ii) In the second row, the lack of texture in DVS causes the person and the background to blend, leading to a left-turn policy to avoid the highlighted area on the right.

(iii) In contrast, our method (third row) can fully take advantage of RGB and DVS to extract task-relevant information and eliminate task-irrelevant and noisy information through joint TD and DMR learning, thereby obtaining an optimal evasion policy.
  • Illustration of decomposition capability of DMR
(i) The first row depicts the original observations and corresponding CAMs of DMR. In extremely low-light conditions, DVS can capture the front pedestrian while RGB camera suffers from exposure failure.


(ii) The second row shows that RGB noise highlights the high beam region on the road, while DVS noise is activated across a broader region, with the highest activation on the building.


(iii) The co-features in the third row attentively grasp the pedestrian and the right roadside simultaneously, which are crucial for driving decision-making.

  • A long sequence demonstration
Time RGB Frame DVS Events RGB Noise DVS Noise Co-features
on RGB
Co-features
on DVS
Time #1
Time #2
Time #3

The table above illustrates a vehicle with high beam headlights approaching from a distance to near in the opposite lane at three different time instances, Time #1, #2, and #3. It is clear that the RGB noise emphasizes the vehicle's high beam headlights and the buildings on the right, whereas the DVS noise focuses on the dense event region on the right. Both types of noise contain a substantial amount of task-irrelevant information, covering unnecessary broad areas. In contrast, the co-features generates a more focused area that is relevant for RL by excluding irrelevant regions. These areas precisely cover the vehicle on the opposite lane and the right roadside, which are crucial cues for driving policies.

The variations in Class Activation Mapping (CAM) closely mirror the alterations in the real scene throughout the entire process. When the vehicle approaches, the RGB noise broadens due to illumination changes, and the co-features focus more on the vehicle. In co-features, there is also a gradual increase in emphasis on the left roadside, and the CAM uniformly cover the right roadside.

Repository requirements

  • create python environment using conda:
conda create -n carla-py37 python=3.7 -y
conda activate carla-py37
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -U gym==0.17.3 cloudpickle==1.5.0 numba==0.51.2 wincertstore==0.2 tornado==4.5.3 msgpack-python==0.5.6 msgpack-rpc-python==0.4.1 stable-baselines3==0.8.0 opencv-python==4.7.0.72 imageio[ffmpeg]==2.28.0 dotmap==1.3.30 termcolor==2.3.0 matplotlib==3.5.3 seaborn-image==0.4.4 scipy==1.7.3 info-nce-pytorch==0.1.4 spikingjelly cupy-cuda117 scikit-image tensorboard kornia timm einops -i https://pypi.tuna.tsinghua.edu.cn/simple
cd carla_root_directory/PythonAPI/carla/dist
pip install carla-0.9.13-cp37-cp37m-manylinux_2_27_x86_64.whl

DMR training & evaluation

  • running CARLA by using:
DISPLAY= ./CarlaUE4.sh -opengl -RenderOffScreen -carla-rpc-port=12121  # headless mode
  • running DMR by using:
bash auto_run_batch_modal.sh
  • choices of some key parameters in train_testm.py:
    • selected_scenario: 'jaywalk', 'highbeam'
    • selected_weather: 'midnight', 'hard_rain'
    • perception_type:
      • single-modality perception: 'RGB-Frame', 'DVS-Frame', 'DVS-Voxel-Grid', 'LiDAR-BEV', 'Depth-Frame'
      • multi-modality perception: 'RGB-Frame+DVS-Frame', 'RGB-Frame+DVS-Voxel-Grid', 'RGB-Frame+Depth-Frame', 'RGB-Frame+LiDAR-BEV'
    • encoder_type:
      • single-modality encoder: 'pixelCarla098'
      • multi-modality encoder: 'DMR_CNN', 'pixelEFNet', 'pixelFPNNet', 'pixelRENet', ...