Paper | Project Page | Video
EmbodiedSAM: Online Segment Any 3D Thing in Real Time
Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
In this work, we presented ESAM, an efficient framework that leverages vision foundation models for online, real-time, fine-grained, generalized and open-vocabulary 3D instance segmentation.
- [2024/8/27]: Fix some bugs.
- [2024/8/22]: Code and demo released.
Demos are a little bit large; please wait a moment to load them. Welcome to the home page for more complete demos and detailed introductions.
For environment setup and dataset preparation, please follow:
For training and evaluation, please follow:
We provide the checkpoints for quick reproduction of the results reported in the paper.
Class-agnostic 3D instance segmentation results on ScanNet200 dataset:
Method | Type | VFM | AP | AP@50 | AP@25 | Speed(ms) | Downloads |
---|---|---|---|---|---|---|---|
SAMPro3D | Offline | SAM | 18.0 | 32.8 | 56.1 | -- | -- |
SAI3D | Offline | SemanticSAM | 30.8 | 50.5 | 70.6 | -- | -- |
SAM3D | Online | SAM | 20.6 | 35.7 | 55.5 | 1369+1518 | -- |
ESAM | Online | SAM | 42.2 | 63.7 | 79.6 | 1369+80 | model |
ESAM-E | Online | FastSAM | 43.4 | 65.4 | 80.9 | 20+80 | model |
Dataset transfer results from ScanNet200 to SceneNN and 3RScan:
Method | Type | ScanNet200-->SceneNN | ScanNet200-->3RScan | ||||
---|---|---|---|---|---|---|---|
AP | AP@50 | AP@25 | AP | AP@50 | AP@25 | ||
SAMPro3D | Offline | 12.6 | 25.8 | 53.2 | 3.9 | 8.0 | 21.0 |
SAI3D | Offline | 18.6 | 34.7 | 65.7 | 5.4 | 11.8 | 27.4 |
SAM3D | Online | 15.1 | 30.0 | 51.8 | 6.2 | 13.0 | 33.9 |
ESAM | Online | 28.8 | 52.2 | 69.3 | 14.1 | 31.2 | 59.6 |
ESAM-E | Online | 28.6 | 50.4 | 71.0 | 13.9 | 29.4 | 58.8 |
3D instance segmentation results on ScanNet dataset:
Method | Type | ScanNet | SceneNN | FPS | Download | ||||
---|---|---|---|---|---|---|---|---|---|
AP | AP@50 | AP@25 | AP | AP@50 | AP@25 | ||||
TD3D | offline | 46.2 | 71.1 | 81.3 | -- | -- | -- | -- | -- |
Oneformer3D | offline | 59.3 | 78.8 | 86.7 | -- | -- | -- | -- | -- |
INS-Conv | online | -- | 57.4 | -- | -- | -- | -- | -- | -- |
TD3D-MA | online | 39.0 | 60.5 | 71.3 | 26.0 | 42.8 | 59.2 | 3.5 | -- |
ESAM-E | online | 41.6 | 60.1 | 75.6 | 27.5 | 48.7 | 64.6 | 10 | model |
ESAM-E+FF | online | 42.6 | 61.9 | 77.1 | 33.3 | 53.6 | 62.5 | 9.8 | model |
Open-Vocabulary 3D instance segmentation results on ScanNet200 dataset:
Method | AP | AP@50 | AP@25 |
---|---|---|---|
SAI3D | 9.6 | 14.7 | 19.0 |
ESAM | 13.7 | 19.2 | 23.9 |
- Release code and checkpoints.
- Release the demo code to directly run ESAM on streaming RGB-D video.
Both students below contributed equally and the order is determined by random draw.
- Xiuwei Xu
- Huangxing Chen
Both advised by Jiwen Lu.
We thank a lot for the flexible codebase of Oneformer3D and Online3D, as well as the valuable datasets provided by ScanNet, SceneNN and 3RScan.
@article{xu2024esam,
title={EmbodiedSAM: Online Segment Any 3D Thing in Real Time},
author={Xiuwei Xu and Huangxing Chen and Linqing Zhao and Ziwei Wang and Jie Zhou and Jiwen Lu},
journal={arXiv preprint arXiv:2408.11811},
year={2024}
}