/SAM-for-Videos

This repository is for the first survey on SAM for videos.

MIT LicenseMIT

Maintenance PR's Welcome Awesome

Segment Anything for Videos: A Systematic Survey

The first survey for : Segment Anything for Videos: A Systematic Survey. Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan. [Paper][ResearchGate][Project]

Abstract: The recent wave of foundation models has witnessed tremendous success in computer vision (CV) and beyond, with the segment anything model (SAM) having sparked a passion for exploring task-agnostic visual foundation models. Empowered by its remarkable zero-shot generalization, SAM is currently challenging numerous traditional paradigms in CV, delivering extraordinary performance not only in various image segmentation and multi-modal segmentation (e.g., text-to-mask) tasks, but also in the video domain. Additionally, the latest released SAM 2 is once again sparking research enthusiasm in the realm of promptable visual segmentation for both images and videos. However, existing surveys mainly focus on SAM in various image processing tasks, a comprehensive and in-depth review in the video domain is notably absent. To address this gap, this work conducts a systematic review on SAM for videos in the era of foundation models. As the first to review the progress of SAM for videos, this work focuses on its applications to various tasks by discussing its recent advances, and innovation opportunities of developing foundation models on broad applications. We begin with a brief introduction to the background of SAM and video-related research domains. Subsequently, we present a systematic taxonomy that categorizes existing methods into three key areas: video understanding, video generation, and video editing, analyzing and summarizing their advantages and limitations. Furthermore, comparative results of SAM-based and current state-of-the-art methods on representative benchmarks, as well as insightful analysis are offered. Finally, we discuss the challenges faced by current research and envision several future research directions in the field of SAM for video and beyond.

This project will be continuously updated. We expect to include more state-of-the-arts on SAM for videos.

The first comprehensive SAM survey: A Comprehensive Survey on Segment Anything Model for Vision and Beyond is at [here].

🔥 Highlights

- 2024.07.31: The first survey on SAM for videos was online.
- 2024.07.30: The SAM 2 was released.

Citation

If you find our work useful in your research, please consider citing:

@article{chunhui2024samforvideos,
  title={Segment Anything for Videos: A Systematic Survey},
  author={Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan},
  journal={arXiv},
  year={2024}
}

Contents

Video Understanding

Video Object Segmentation

Title arXiv Github Pub. & Date
SAM 2: Segment Anything in Images and Videos arXiv github arXiv-2024
Segment Anything in High Quality arXiv github NeurIPS-2023
High-Quality Entity Segmentation arXiv github ICCV-2023
Tracking Anything with Decoupled Video Segmentation arXiv github ICCV-2023
DSEC-MOS: Segment Any Moving Object with Moving Ego Vehicle arXiv github arXiv-2023
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching arXiv github arXiv-2023
Personalize Segment Anything Model with One Shot arXiv github arXiv-2023
UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model arXiv - arXiv-2023
3rd Place Solution for PVUW2023 VSS Track: A Large Model for Semantic Segmentation on VSPW arXiv - arXiv-2023

Video Object Tracking

Title arXiv Github Pub. & Date
Tracking Anything in High Quality arXiv github arXiv-2023
Tracking Anything with Decoupled Video Segmentation arXiv github ICCV-2023
Segment and Track Anything arXiv github arXiv-2023
Segment Anything Meets Point Tracking arXiv github arXiv-2023
Track Anything: Segment Anything Meets Videos arXiv github arXiv-2023
SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation arXiv github arXiv-2023
Unifying Foundation Models with Quadrotor Control for Visual Tracking Beyond Object Categories arXiv - arXiv-2023
UniQuadric: A SLAM Backend for Unknown Rigid Object 3D Tracking and Light-Weight Modeling arXiv - arXiv-2023
Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models arXiv github arXiv-2023
Follow Anything: Open-set detection, tracking, and following in real-time arXiv github arXiv-2023
SAM for Poultry Science arXiv - arXiv-2023
ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single Object Tracking arXiv - arXiv-2023
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing arXiv github arXiv-2023

Video Shadow Detection

Title arXiv Github Pub. & Date
Detect Any Shadow: Segment Anything for Video Shadow Detection arXiv github arXiv-2023

Deepfake

Title arXiv Github Pub. & Date
Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization arXiv github arXiv-2023

Miscellaneous

Audio-Visual Segmentation

Title arXiv Github Pub. & Date
AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation arXiv - arXiv-2023
Leveraging Foundation models for Unsupervised Audio-Visual Segmentation arXiv - arXiv-2023
Prompting Segmentation with Sound is Generalizable Audio-Visual Source LocalizerPrompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer arXiv - arXiv-2023

Referring Video Object Segmentation

Title arXiv Github Pub. & Date
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation arXiv github arXiv-2023

Domain Specific

Medical Videos

Title arXiv Github Pub. & Date
Spatio-Temporal Analysis of Patient-Derived Organoid Videos Using Deep Learning for the Prediction of Drug Efficacy arXiv - ICCV Workshop-2023
SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation arXiv - MICCAI MedAGI Workshop-2023
MediViSTA-SAM: Zero-shot Medical Video Analysis with Spatio-temporal SAM Adaptation arXiv github arXiv-2023
SAMSNeRF: Segment Anything Model (SAM) Guides Dynamic Surgical Scene Reconstruction by Neural Radiance Field (NeRF) arXiv github arXiv-2023
SuPerPM: A Large Deformation-Robust Surgical Perception Framework Based on Deep Point Matching Learned from Physical Constrained Simulation Data arXiv - arXiv-2023
SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation arXiv github arXiv-2023

Domain Adaptation

Title arXiv Github Pub. & Date
Learning from SAM: Harnessing a Segmentation Foundation Model for Sim2Real Domain Adaptation through Regularization arXiv - arXiv-2023
SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation arXiv github arXiv-2023

Tool Software

Title arXiv Github Pub. & Date
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models arXiv - arXiv-2023

More Directions

Title arXiv Github Pub. & Date
Generative AI-driven Semantic Communication Framework for NextG Wireless Network arXiv - arXiv-2023
Learning from Human Videos for Robotic Manipulation arXiv github arXiv-2023
Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation arXiv - arXiv-2023
Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots arXiv github arXiv-2023
ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts arXiv github arXiv-2023
SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model arXiv - arXiv-2023
Virtual Augmented Reality for Atari Reinforcement Learning arXiv - arXiv-2023

Video Generation

Video Synthesis

Title arXiv Github Pub. & Date
Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model arXiv - arXiv-2023
DisCo: Disentangled Control for Realistic Human Dance Generation arXiv github arXiv-2023

Video Super-Resolution

Title arXiv Github Pub. & Date
Can SAM Boost Video Super-Resolution? arXiv arXiv-2023

3D Reconstruction

Title arXiv Github Pub. & Date
SAM3D: Segment Anything in 3D Scenes arXiv github arXiv-2023
A One Stop 3D Target Reconstruction and multilevel Segmentation Method arXiv github arXiv-2023

Video Dataset Annotation Generation

Title arXiv Github Pub. & Date
Scalable Mask Annotation for Video Text Spotting arXiv github arXiv-2023
Audio-Visual Instance Segmentation arXiv - arXiv-2023
Learning the What and How of Annotation in Video Object Segmentation arXiv github WACV-2023
Propagating Semantic Labels in Video Data arXiv github arXiv-2023
Stable Yaw Estimation of Boats from the Viewpoint of UAVs and USVs arXiv - arXiv-2023
arXiv github arXiv-2023

Video Editing

Generic Video Editing

Title arXiv Github Pub. & Date
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts arXiv github arXiv-2023

Text Guided Video Editing

Title arXiv Github Pub. & Date
CVPR 2023 Text Guided Video Editing Competition arXiv github arXiv-2023

Object Removing

Title arXiv Github Pub. & Date
OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation with Neural Radiance Fields arXiv - arXiv-2023

License

This project is released under the MIT license. Please see the LICENSE file for more information.