/ESAM

EmbodiedSAM: Online Segment Any 3D Thing in Real Time

Primary LanguagePython

EmbodiedSAM: Online Segment Any 3D Thing in Real Time

EmbodiedSAM: Online Segment Any 3D Thing in Real Time
Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu

In this work, we presented ESAM, an efficient framework that leverages vision foundation models for online, real-time, fine-grained, generalized and open-vocabulary 3D instance segmentation.

News

  • [2024/8/27]: Fix some bugs.
  • [2024/8/22]: Code and demo released.

Demo

Bedroom:

demo

Office:

demo

Demos are a little bit large; please wait a moment to load them. Welcome to the home page for more complete demos and detailed introductions.

Method

Method Pipeline: overview

Getting Started

For environment setup and dataset preparation, please follow:

For training and evaluation, please follow:

Main Results

We provide the checkpoints for quick reproduction of the results reported in the paper.

Class-agnostic 3D instance segmentation results on ScanNet200 dataset:

Method Type VFM AP AP@50 AP@25 Speed(ms) Downloads
SAMPro3D Offline SAM 18.0 32.8 56.1 -- --
SAI3D Offline SemanticSAM 30.8 50.5 70.6 -- --
SAM3D Online SAM 20.6 35.7 55.5 1369+1518 --
ESAM Online SAM 42.2 63.7 79.6 1369+80 model
ESAM-E Online FastSAM 43.4 65.4 80.9 20+80 model

Dataset transfer results from ScanNet200 to SceneNN and 3RScan:

Method Type ScanNet200-->SceneNN ScanNet200-->3RScan
AP AP@50 AP@25 AP AP@50 AP@25
SAMPro3D Offline 12.6 25.8 53.2 3.9 8.0 21.0
SAI3D Offline 18.6 34.7 65.7 5.4 11.8 27.4
SAM3D Online 15.1 30.0 51.8 6.2 13.0 33.9
ESAM Online 28.8 52.2 69.3 14.1 31.2 59.6
ESAM-E Online 28.6 50.4 71.0 13.9 29.4 58.8

3D instance segmentation results on ScanNet dataset:

Method Type ScanNet SceneNN FPS Download
AP AP@50 AP@25 AP AP@50 AP@25
TD3D offline 46.2 71.1 81.3 -- -- -- -- --
Oneformer3D offline 59.3 78.8 86.7 -- -- -- -- --
INS-Conv online -- 57.4 -- -- -- -- -- --
TD3D-MA online 39.0 60.5 71.3 26.0 42.8 59.2 3.5 --
ESAM-E online 41.6 60.1 75.6 27.5 48.7 64.6 10 model
ESAM-E+FF online 42.6 61.9 77.1 33.3 53.6 62.5 9.8 model

Open-Vocabulary 3D instance segmentation results on ScanNet200 dataset:

Method AP AP@50 AP@25
SAI3D 9.6 14.7 19.0
ESAM 13.7 19.2 23.9

TODO List

  • Release code and checkpoints.
  • Release the demo code to directly run ESAM on streaming RGB-D video.

Contributors

Both students below contributed equally and the order is determined by random draw.

Both advised by Jiwen Lu.

Acknowledgement

We thank a lot for the flexible codebase of Oneformer3D and Online3D, as well as the valuable datasets provided by ScanNet, SceneNN and 3RScan.

Citation

@article{xu2024esam, 
      title={EmbodiedSAM: Online Segment Any 3D Thing in Real Time}, 
      author={Xiuwei Xu and Huangxing Chen and Linqing Zhao and Ziwei Wang and Jie Zhou and Jiwen Lu},
      journal={arXiv preprint arXiv:2408.11811},
      year={2024}
}