Segment Anything for Videos: A Systematic Survey

The first survey for : Segment Anything for Videos: A Systematic Survey. Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan. [Paper][ResearchGate][Project]

Abstract: The recent wave of foundation models has witnessed tremendous success in computer vision (CV) and beyond, with the segment anything model (SAM) having sparked a passion for exploring task-agnostic visual foundation models. Empowered by its remarkable zero-shot generalization, SAM is currently challenging numerous traditional paradigms in CV, delivering extraordinary performance not only in various image segmentation and multi-modal segmentation (e.g., text-to-mask) tasks, but also in the video domain. Additionally, the latest released SAM 2 is once again sparking research enthusiasm in the realm of promptable visual segmentation for both images and videos. However, existing surveys mainly focus on SAM in various image processing tasks, a comprehensive and in-depth review in the video domain is notably absent. To address this gap, this work conducts a systematic review on SAM for videos in the era of foundation models. As the first to review the progress of SAM for videos, this work focuses on its applications to various tasks by discussing its recent advances, and innovation opportunities of developing foundation models on broad applications. We begin with a brief introduction to the background of SAM and video-related research domains. Subsequently, we present a systematic taxonomy that categorizes existing methods into three key areas: video understanding, video generation, and video editing, analyzing and summarizing their advantages and limitations. Furthermore, comparative results of SAM-based and current state-of-the-art methods on representative benchmarks, as well as insightful analysis are offered. Finally, we discuss the challenges faced by current research and envision several future research directions in the field of SAM for video and beyond.

This project will be continuously updated. We expect to include more state-of-the-arts on SAM for videos.

The first comprehensive SAM survey: A Comprehensive Survey on Segment Anything Model for Vision and Beyond is at [here].

🔥 Highlights

- 2024.07.31: The first survey on SAM for videos was online.
- 2024.07.30: The SAM 2 was released.

Citation

If you find our work useful in your research, please consider citing:

@article{chunhui2024samforvideos,
  title={Segment Anything for Videos: A Systematic Survey},
  author={Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan},
  journal={arXiv},
  year={2024}
}

Video Understanding
Video Generation
Video Editing

Video Understanding

Video Object Segmentation

Title	Github	Pub. & Date
SAM 2: Segment Anything in Images and Videos	github	arXiv-2024
Segment Anything in High Quality	github	NeurIPS-2023
High-Quality Entity Segmentation	github	ICCV-2023
Tracking Anything with Decoupled Video Segmentation	github	ICCV-2023
DSEC-MOS: Segment Any Moving Object with Moving Ego Vehicle	github	arXiv-2023
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching	github	arXiv-2023
Personalize Segment Anything Model with One Shot	github	arXiv-2023
UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model	-	arXiv-2023
3rd Place Solution for PVUW2023 VSS Track: A Large Model for Semantic Segmentation on VSPW	-	arXiv-2023

Video Object Tracking

Title	Github	Pub. & Date
Tracking Anything in High Quality	github	arXiv-2023
Tracking Anything with Decoupled Video Segmentation	github	ICCV-2023
Segment and Track Anything	github	arXiv-2023
Segment Anything Meets Point Tracking	github	arXiv-2023
Track Anything: Segment Anything Meets Videos	github	arXiv-2023
SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation	github	arXiv-2023
Unifying Foundation Models with Quadrotor Control for Visual Tracking Beyond Object Categories	-	arXiv-2023
UniQuadric: A SLAM Backend for Unknown Rigid Object 3D Tracking and Light-Weight Modeling	-	arXiv-2023
Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models	github	arXiv-2023
Follow Anything: Open-set detection, tracking, and following in real-time	github	arXiv-2023
SAM for Poultry Science	-	arXiv-2023
ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single Object Tracking	-	arXiv-2023
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing	github	arXiv-2023

Video Shadow Detection

Title	arXiv	Github	Pub. & Date
Detect Any Shadow: Segment Anything for Video Shadow Detection		github	arXiv-2023

Deepfake

Title	arXiv	Github	Pub. & Date
Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization		github	arXiv-2023

Miscellaneous

Audio-Visual Segmentation

Title	Github	Pub. & Date
AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation	-	arXiv-2023
Leveraging Foundation models for Unsupervised Audio-Visual Segmentation	-	arXiv-2023
Prompting Segmentation with Sound is Generalizable Audio-Visual Source LocalizerPrompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer	-	arXiv-2023

Referring Video Object Segmentation

Title	arXiv	Github	Pub. & Date
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation		github	arXiv-2023

Domain Specific

Medical Videos

Title	Github	Pub. & Date
Spatio-Temporal Analysis of Patient-Derived Organoid Videos Using Deep Learning for the Prediction of Drug Efficacy	-	ICCV Workshop-2023
SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation	-	MICCAI MedAGI Workshop-2023
MediViSTA-SAM: Zero-shot Medical Video Analysis with Spatio-temporal SAM Adaptation	github	arXiv-2023
SAMSNeRF: Segment Anything Model (SAM) Guides Dynamic Surgical Scene Reconstruction by Neural Radiance Field (NeRF)	github	arXiv-2023
SuPerPM: A Large Deformation-Robust Surgical Perception Framework Based on Deep Point Matching Learned from Physical Constrained Simulation Data	-	arXiv-2023
SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation	github	arXiv-2023

Domain Adaptation

Title	arXiv	Github	Pub. & Date
Learning from SAM: Harnessing a Segmentation Foundation Model for Sim2Real Domain Adaptation through Regularization		-	arXiv-2023
SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation		github	arXiv-2023

Tool Software

Title	arXiv	Github	Pub. & Date
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models		-	arXiv-2023

More Directions

Title	Github	Pub. & Date
Generative AI-driven Semantic Communication Framework for NextG Wireless Network	-	arXiv-2023
Learning from Human Videos for Robotic Manipulation	github	arXiv-2023
Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation	-	arXiv-2023
Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots	github	arXiv-2023
ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts	github	arXiv-2023
SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model	-	arXiv-2023
Virtual Augmented Reality for Atari Reinforcement Learning	-	arXiv-2023

Video Generation

Video Synthesis

Title	arXiv	Github	Pub. & Date
Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model		-	arXiv-2023
DisCo: Disentangled Control for Realistic Human Dance Generation		github	arXiv-2023

Video Super-Resolution

Title	arXiv	Github	Pub. & Date
Can SAM Boost Video Super-Resolution?			arXiv-2023

3D Reconstruction

Title	arXiv	Github	Pub. & Date
SAM3D: Segment Anything in 3D Scenes		github	arXiv-2023
A One Stop 3D Target Reconstruction and multilevel Segmentation Method		github	arXiv-2023

Video Dataset Annotation Generation

Title	Github	Pub. & Date
Scalable Mask Annotation for Video Text Spotting	github	arXiv-2023
Audio-Visual Instance Segmentation	-	arXiv-2023
Learning the What and How of Annotation in Video Object Segmentation	github	WACV-2023
Propagating Semantic Labels in Video Data	github	arXiv-2023
Stable Yaw Estimation of Boats from the Viewpoint of UAVs and USVs	-	arXiv-2023
	github	arXiv-2023

Video Editing

Generic Video Editing

Title	arXiv	Github	Pub. & Date
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts		github	arXiv-2023

Text Guided Video Editing

Title	arXiv	Github	Pub. & Date
CVPR 2023 Text Guided Video Editing Competition		github	arXiv-2023

Object Removing

Title	arXiv	Github	Pub. & Date
OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation with Neural Radiance Fields		-	arXiv-2023

License

This project is released under the MIT license. Please see the LICENSE file for more information.

rolson24/SAM-for-Videos

Segment Anything for Videos: A Systematic Survey

🔥 Highlights

Citation

Contents

Video Understanding

Video Object Segmentation

Video Object Tracking

Video Shadow Detection

Deepfake

Miscellaneous

Audio-Visual Segmentation

Referring Video Object Segmentation

Domain Specific

Medical Videos

Domain Adaptation

Tool Software

More Directions

Video Generation

Video Synthesis

Video Super-Resolution

3D Reconstruction

Video Dataset Annotation Generation

Video Editing

Generic Video Editing

Text Guided Video Editing

Object Removing

License