The first survey for : Segment Anything for Videos: A Systematic Survey. Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan. [Paper][ResearchGate][Project]
Abstract: The recent wave of foundation models has witnessed tremendous success in computer vision (CV) and beyond, with the segment anything model (SAM) having sparked a passion for exploring task-agnostic visual foundation models. Empowered by its remarkable zero-shot generalization, SAM is currently challenging numerous traditional paradigms in CV, delivering extraordinary performance not only in various image segmentation and multi-modal segmentation (e.g., text-to-mask) tasks, but also in the video domain. Additionally, the latest released SAM 2 is once again sparking research enthusiasm in the realm of promptable visual segmentation for both images and videos. However, existing surveys mainly focus on SAM in various image processing tasks, a comprehensive and in-depth review in the video domain is notably absent. To address this gap, this work conducts a systematic review on SAM for videos in the era of foundation models. As the first to review the progress of SAM for videos, this work focuses on its applications to various tasks by discussing its recent advances, and innovation opportunities of developing foundation models on broad applications. We begin with a brief introduction to the background of SAM and video-related research domains. Subsequently, we present a systematic taxonomy that categorizes existing methods into three key areas: video understanding, video generation, and video editing, analyzing and summarizing their advantages and limitations. Furthermore, comparative results of SAM-based and current state-of-the-art methods on representative benchmarks, as well as insightful analysis are offered. Finally, we discuss the challenges faced by current research and envision several future research directions in the field of SAM for video and beyond.
This project will be continuously updated. We expect to include more state-of-the-arts on SAM for videos.
The first comprehensive SAM survey: A Comprehensive Survey on Segment Anything Model for Vision and Beyond is at [here].
- 2024.07.31: The first survey on SAM for videos was online.
- 2024.07.30: The SAM 2 was released.
If you find our work useful in your research, please consider citing:
@article{chunhui2024samforvideos,
title={Segment Anything for Videos: A Systematic Survey},
author={Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan},
journal={arXiv},
year={2024}
}
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
SAM 2: Segment Anything in Images and Videos | github | arXiv-2024 | |
Segment Anything in High Quality | github | NeurIPS-2023 | |
High-Quality Entity Segmentation | github | ICCV-2023 | |
Tracking Anything with Decoupled Video Segmentation | github | ICCV-2023 | |
DSEC-MOS: Segment Any Moving Object with Moving Ego Vehicle | github | arXiv-2023 | |
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching | github | arXiv-2023 | |
Personalize Segment Anything Model with One Shot | github | arXiv-2023 | |
UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model | - | arXiv-2023 | |
3rd Place Solution for PVUW2023 VSS Track: A Large Model for Semantic Segmentation on VSPW | - | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Detect Any Shadow: Segment Anything for Video Shadow Detection | github | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization | github | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation | github | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Learning from SAM: Harnessing a Segmentation Foundation Model for Sim2Real Domain Adaptation through Regularization | - | arXiv-2023 | |
SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation | github | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models | - | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model | - | arXiv-2023 | |
DisCo: Disentangled Control for Realistic Human Dance Generation | github | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Can SAM Boost Video Super-Resolution? | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
SAM3D: Segment Anything in 3D Scenes | github | arXiv-2023 | |
A One Stop 3D Target Reconstruction and multilevel Segmentation Method | github | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Scalable Mask Annotation for Video Text Spotting | github | arXiv-2023 | |
Audio-Visual Instance Segmentation | - | arXiv-2023 | |
Learning the What and How of Annotation in Video Object Segmentation | github | WACV-2023 | |
Propagating Semantic Labels in Video Data | github | arXiv-2023 | |
Stable Yaw Estimation of Boats from the Viewpoint of UAVs and USVs | - | arXiv-2023 | |
github | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts | github | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
CVPR 2023 Text Guided Video Editing Competition | github | arXiv-2023 |
Title | arXiv | Github | Pub. & Date |
---|---|---|---|
OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation with Neural Radiance Fields | - | arXiv-2023 |
This project is released under the MIT license. Please see the LICENSE file for more information.