Paper | Project Page | Video
EmbodiedSAM: Online Segment Any 3D Thing in Real Time
Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
In this work, we presented ESAM, an efficient framework that leverages vision foundation models for online, real-time, fine-grained, generalized and open-vocabulary 3D instance segmentation.
- [2024/8/22]: Code and demo released.
Demos are a little bit large; please wait a moment to load them. Welcome to the home page for more complete demos and detailed introductions.
For environment setup and dataset preparation, please follow:
For training and evaluation, please follow:
We provide the checkpoints for quick reproduction of the results reported in the paper. We have made some modifications to further improve the performance so the results below are higher than ones in our paper. Check changelog for more details.
Class-agnostic 3D instance segmentation results on ScanNet200 dataset:
Method | Type | VFM | AP | AP@50 | AP@25 | Speed(ms) | Downloads |
---|---|---|---|---|---|---|---|
SAMPro3D | Offline | SAM | 18.0 | 32.8 | 56.1 | -- | -- |
SAI3D | Offline | SemanticSAM | 30.8 | 50.5 | 70.6 | -- | -- |
SAM3D | Online | SAM | 20.6 | 35.7 | 55.5 | 1369+1518 | -- |
ESAM | Online | SAM | 42.2 | 63.7 | 79.6 | 1369+80 | model |
ESAM-E | Online | FastSAM | 43.4 | 65.4 | 80.9 | 20+80 | model |
Dataset transfer results from ScanNet200 to SceneNN and 3RScan:
Method | Type | ScanNet200-->SceneNN | ScanNet200-->3RScan | ||||
---|---|---|---|---|---|---|---|
AP | AP@50 | AP@25 | AP | AP@50 | AP@25 | ||
SAMPro3D | Offline | 12.6 | 25.8 | 53.2 | 3.9 | 8.0 | 21.0 |
SAI3D | Offline | 18.6 | 34.7 | 65.7 | 5.4 | 11.8 | 27.4 |
SAM3D | Online | 15.1 | 30.0 | 51.8 | 6.2 | 13.0 | 33.9 |
ESAM | Online | 28.8 | 52.2 | 69.3 | 14.1 | 31.2 | 59.6 |
ESAM-E | Online | 28.6 | 50.4 | 71.0 | 13.9 | 29.4 | 58.8 |
3D instance segmentation results on ScanNet dataset:
Method | Type | ScanNet | SceneNN | FPS | Download | ||||
---|---|---|---|---|---|---|---|---|---|
AP | AP@50 | AP@25 | AP | AP@50 | AP@25 | ||||
TD3D | offline | 46.2 | 71.1 | 81.3 | -- | -- | -- | -- | -- |
Oneformer3D | offline | 59.3 | 78.8 | 86.7 | -- | -- | -- | -- | -- |
INS-Conv | online | -- | 57.4 | -- | -- | -- | -- | -- | -- |
TD3D-MA | online | 39.0 | 60.5 | 71.3 | 26.0 | 42.8 | 59.2 | 3.5 | -- |
ESAM-E | online | 41.6 | 60.1 | 75.6 | 27.5 | 48.7 | 64.6 | 10 | model |
ESAM-E+FF | online | 42.6 | 61.9 | 77.1 | 33.3 | 53.6 | 62.5 | 9.8 | model |
Open-Vocabulary 3D instance segmentation results on ScanNet200 dataset:
Method | AP | AP@50 | AP@25 |
---|---|---|---|
SAI3D | 9.6 | 14.7 | 19.0 |
ESAM | 13.7 | 19.2 | 23.9 |
We thank a lot for the flexible codebase of Oneformer3D and Online3D, as well as the valuable datasets provided by ScanNet, SceneNN and 3RScan.
@article{xu2024esam,
title={EmbodiedSAM: Online Segment Any 3D Thing in Real Time},
author={Xiuwei Xu and Huangxing Chen and Linqing Zhao and Ziwei Wang and Jie Zhou and Jiwen Lu},
journal={arXiv preprint arXiv:2408.11811},
year={2024}
}