See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation, CoRL 2022 (Official Repo)

Abstract: We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation

Project|Paper|Data|Bibtex

Getting Started

To clone this repo, run:

git clone https://github.com/JunzheJosephZhu/see_hear_feel.git
cd see_hear_feel

Install Dependencies

To set up the required libraries to train/test a model, run:

conda create -n "multimodal" python=3.7 -y && conda activate multimodal
pip install -r requirements.txt

Prepare dataset

You can download an example dataset here

After downloading, unzip and rename the folder to data, and place it under the project folder.

To preprocess the data, run
python utils/h5py_convert.py

To split the training/testing dataset, run
python split_train_val.py

Brief explanation for the example dataset: Under data/test_recordings, each folder is an episode. timestamps.json contains the human demo actions and the pose history of the robot, while each subfolder contains a stream of sensory inputs.

Train/test your own model

For the ResNet Encoder + MSA model described in the original paper, run
python train_imitation.py --ablation vg_t_ah

Alternatively, we also provide a modified implementation of TimeSformer that takes multimodal data as inputs. To train this, run
python train_transformer.py --ablation vg_t_ah

Run ablation studies

To run ablation studies, change the --ablation argument. For example, to train a model with only vision+tactile inputs, run
python train_imitation.py --ablation vg_t Here are what each symbol means:

Symbol Description
vg camera input from a gripper-mounted(first person) camera
vf camera input from a fixed perspective
ah microphone input from piezo-electric stuck to the platform(i.e. peg insertion base/tube for pouring)
ag microphone input from piezo-electric mounted on gripper
t Gelsight sensor input

Evaluate your results

To view your model's results, run
tensorboard --logdir exp{data}{task}

Citation

If you find our work relevant, please cite us using the following bibtex:

@inproceedings{li2022seehearfeel,
    title={See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation},
    author={Hao Li and Yizhi Zhang and Junzhe Zhu and Shaoxiong Wang and Michelle A. Lee and Huazhe Xu and Edward Adelson and Li Fei-Fei and Ruohan Gao and Jiajun Wu},
    booktitle={CoRL},
    year={2022}
}

TODO(for Hao)

  • Add some demo videos
  • test the setup commands locally
  • provide a pretrained vg_t_ah model