Abstract: We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation
To clone this repo, run:
git clone https://github.com/JunzheJosephZhu/see_hear_feel.git
cd see_hear_feel
To set up the required libraries to train/test a model, run:
conda create -n "multimodal" python=3.7 -y && conda activate multimodal
pip install -r requirements.txt
You can download an example dataset here
After downloading, unzip and rename the folder to data
, and place it under the project folder.
To preprocess the data, run
python utils/h5py_convert.py
To split the training/testing dataset, run
python split_train_val.py
Brief explanation for the example dataset: Under data/test_recordings
, each folder is an episode. timestamps.json
contains the human demo actions and the pose history of the robot, while each subfolder contains a stream of sensory inputs.
For the ResNet Encoder + MSA model described in the original paper, run
python train_imitation.py --ablation vg_t_ah
Alternatively, we also provide a modified implementation of TimeSformer that takes multimodal data as inputs. To train this, run
python train_transformer.py --ablation vg_t_ah
To run ablation studies, change the --ablation
argument. For example, to train a model with only vision+tactile inputs, run
python train_imitation.py --ablation vg_t
Here are what each symbol means:
Symbol | Description |
---|---|
vg | camera input from a gripper-mounted(first person) camera |
vf | camera input from a fixed perspective |
ah | microphone input from piezo-electric stuck to the platform(i.e. peg insertion base/tube for pouring) |
ag | microphone input from piezo-electric mounted on gripper |
t | Gelsight sensor input |
To view your model's results, run
tensorboard --logdir exp{data}{task}
If you find our work relevant, please cite us using the following bibtex:
@inproceedings{li2022seehearfeel,
title={See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation},
author={Hao Li and Yizhi Zhang and Junzhe Zhu and Shaoxiong Wang and Michelle A. Lee and Huazhe Xu and Edward Adelson and Li Fei-Fei and Ruohan Gao and Jiajun Wu},
booktitle={CoRL},
year={2022}
}
- Add some demo videos
- test the setup commands locally
- provide a pretrained vg_t_ah model