A curated list of awesome video object/instance segmentation resources.
Last updated: 2020/12/19
Video object segmentation is a binary labeling problem aiming to separate specific foreground object(s) from the background region of a video, and each object mask should be linked across frames.
Video instance segmentation not only need to segment foreground object(s)/instance(s), but also have to identify the category of each object.
- DAVIS
[Download]
- YouTube-VOS and -VIS
[Download]
- Cityscapes
[Download]
- CamVid
[Download]
- COCO
[Download]
- A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation | [CVPR' 16] |
[pdf]
- (OSVOS) One-Shot Video Object Segmentation | [CVPR' 17] |
[pdf]
|[code]
- (MaskTrack) Learning Video Object Segmentation from Static Images | [CVPR' 17] |
[pdf]
|[code]
- (OSVOS-S) Video object segmentation without temporal information | [TPAMI' 18] |
[pdf]
- (Lucid) Lucid Data Dreaming for Video Object Segmentation | [IJCV' 19] |
[pdf]
|[code]
- (PTSNet) Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation | [arxiv' 19] |
[pdf]
|[code]
- (SiamMask) Fast Online Object Tracking and Segmentation: A Unifying Approach | [CVPR' 19] |
[pdf]
|[code]
- (RANet) RANet: Ranking Attention Network for Fast Video Object Segmentation | [ICCV' 19] |
[pdf]
|[code]
- (FEELVOS) Feelvos: Fast end-to-end embedding learning for video object segmentation | [CVPR' 19] |
[pdf]
|[code]
- (STCNN) Spatiotemporal CNN for Video Object Segmentation | [CVPR' 19] |
[pdf]
|[code]
- (MHP-VOS) MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation | [CVPR' 19] |
[pdf]
|[code]
- (A-GAME) A Generative Appearance Model for End-to-end Video Object Segmentation | [CVPR' 19] |
[pdf]
|[code]
- (RVOS) RVOS: End-to-End Recurrent Network for Video Object Segmentation | [CVPR' 19] |
[pdf]
|[code]
- (CapsuleVOS) CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing | [ICCV' 19] |
[pdf]
|[code]
- (AGSS-VOS) AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation | [ICCV' 19] |
[pdf]
|[code]
- (STM) Video Object Segmentation using Space-Time Memory Networks | [ICCV' 19] |
[pdf]
|[code]
- (e-OSVOS) Make One-Shot Video Object Segmentation Efficient Again | [NIPS' 20] |
[pdf]
|[code]
- (GC) Fast Video Object Segmentation using the Global Context Module | [CVPR' 20] |
[pdf]
- (FRTM-VOS) Learning Fast and Robust Target Models for Video Object Segmentation | [CVPR' 20] |
[pdf]
|[code]
- (PMVOS) PMVOS: Pixel-Level Matching-Based Video Object Segmentation | [arxiv' 20] |
[pdf]
- (CFBI) Collaborative Video Object Segmentation by Foreground-Background Integration | [ECCV' 20] |
[pdf]
|[code]
- (TVOS) A Transductive Approach for Video Object Segmentation | [CVPR' 20] |
[pdf]
|[code]
- (Siam R-CNN) Siam R-CNN: Visual Tracking by Re-Detection | [CVPR' 20] |
[pdf]
|[code]
- (MuG-W) Learning Video Object Segmentation from Unlabeled Videos | [CVPR' 20] |
[pdf]
|[code]
- (LWLVOS) Learning What to Learn for Video Object Segmentation | [CVPR' 20] |
[pdf]
- (FTMU) Fast Template Matching and Update for Video Object Tracking and Segmentation | [CVPR' 20] |
[pdf]
|[code]
- (SAT) State-Aware Tracker for Real-Time Video Object Segmentation | [CVPR' 20] |
[pdf]
|[code]
- (AFB-URR) Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement | [NIPS' 20] |
[pdf]
|[code]
- (STM-cycle+GC) Delving into the Cyclic Mechanism in Semi-supervised Video Object Segmentation | [NIPS' 20] |
[pdf]
|[code]
- (KMN) Kernelized Memory Network for Video Object Segmentation | [ECCV' 20] |
[pdf]
- (GraphMemVOS) Video Object Segmentation with Episodic Graph Memory Networks | [ECCV' 20] |
[pdf]
) |[code]
- (TAN-DTTM) Fast Video Object Segmentation with Temporal Aggregation Network and Dynamic Template Matching | [CVPR' 20] |
[pdf]
- (TTVOS) TTVOS: Lightweight Video Object Segmentation with Adaptive Template Attention Module and Temporal Consistency Loss | [arxiv' 20] |
[pdf]
- (F2Net) F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation | [arxiv' 20] |
[pdf]
- (HS2S) Hybrid-S2S: Video Object Segmentation with Recurrent Networks and Correspondence Matching | [arxiv' 20] |
[pdf]
|[code]
- (STGNN) Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation | [arxiv' 20] |
[pdf]
- (DTMNet) Dual Temporal Memory Network for Efficient Video Object Segmentation | [arxiv' 20] |
[pdf]
FT==Online Fine-tuning/Learning
OF==Optical FLow
Method | year | Technique | J | F | J&F | FPS |
---|---|---|---|---|---|---|
OSVOS | 2017 | FT | 79.8 | 80.6 | 80.2 | 0.1 |
MaskTrack | 2017 | FT+OF | 79.7 | 75.4 | 77.6 | 0.08 |
OSVOS-S | 2018 | FT | 85.6 | 87.5 | 86.6 | 0.22 |
Lucid | 2019 | FT+OF | 83.9 | 82.0 | 83.0 | |
SiamMask | 2019 | 71.7 | 67.8 | 69.8 | 55 | |
RANet | 2019 | 85.5 | 85.4 | 85.5 | 30 | |
RANet | 2019 | FT | 86.6 | 87.6 | 87.1 | 0.25 |
FEELVOS | 2019 | 81.1 | 82.2 | 81.7 | 2.2 | |
STCNN | 2019 | 83.8 | 83.8 | 83.8 | ||
MHP-VOS | 2019 | 87.6 | 89.5 | 88.6 | ||
A-GAME | 2019 | 81.5 | 82.2 | 81.9 | 15 | |
STM | 2019 | 88.7 | 90.1 | 89.4 | 6.25 | |
MuG-W | 2020 | 65.7 | 63.6 | 64.7 | ||
Siam R-CNN | 2020 | 76.8 | 80.4 | 78.6 | 4.2 | |
e-OSVOS | 2020 | FT | 86.6 | 87.0 | 86.8 | 0.29 |
GC | 2020 | 87.6 | 85.7 | 86.6 | 25 | |
FRTM-VOS | 2020 | 83.5 | 21.9 | |||
PMVOS | 2020 | 86.1 | 85.1 | 85.6 | 54 | |
TTVOS | 2020 | 83.8 | 39.6 | |||
STGNN | 2020 | 85.4 | 86.0 | 85.7 | 6 | |
CFBI | 2020 | 88.3 | 90.5 | 89.4 | 5 | |
KMN | 2020 | 89.5 | 91.5 | 90.5 | 8.3 | |
FTMU | 2020 | 77.5 | 78.9 | 11.1 | ||
DTMNet | 2020 | 85.9 | 84.9 | 85.4 | 8.3 | |
SAT | 2020 | 82.6 | 83.6 | 83.1 | 39 |
Method | year | Technique | J | F | J&F | FPS |
---|---|---|---|---|---|---|
OSVOS | 2017 | FT | 56.6 | 63.9 | 60.3 | 0.1 |
OSVOS-S | 2018 | FT | 64.7 | 71.3 | 68.0 | 0.22 |
SiamMask | 2019 | 54.3 | 58.5 | 56.4 | 55 | |
RANet | 2019 | 63.2 | 68.2 | 65.7 | 30 | |
FEELVOS | 2019 | 69.1 | 74.0 | 71.6 | 2.2 | |
STCNN | 2019 | 58.7 | 64.6 | 61.7 | ||
MHP-VOS | 2019 | 73.4 | 78.9 | 76.2 | ||
A-GAME | 2019 | 68.5 | 73.6 | 71.1 | 15 | |
AGSS-VOS | 2019 | 64.9 | 69.9 | 67.4 | 10 | |
RVOS | 2019 | 57.5 | 63.6 | 60.6 | ||
STM | 2019 | 79.2 | 84.3 | 81.8 | 6.25 | |
MuG-W | 2020 | 54.1 | 58.0 | 56.1 | ||
Siam R-CNN | 2020 | 66.1 | 75.0 | 70.6 | 3.1 | |
e-OSVOS | 2020 | FT | 74.4 | 80.0 | 77.2 | 0.29 |
GC | 2020 | 69.3 | 73.5 | 71.4 | 25 | |
FRTM-VOS | 2020 | 76.7 | 21.9 | |||
PMVOS | 2020 | 71.2 | 76.7 | 74.0 | 54 | |
TVOS | 2020 | 69.9 | 74.7 | 72.3 | 37 | |
TTVOS | 2020 | 67.8 | 39.6 | |||
STGNN | 2020 | 71.5 | 77.9 | 74.7 | 6 | |
AFB-URR | 2020 | 73.0 | 76.1 | 74.6 | 4 | |
STM-cycle+GC | 2020 | 69.3 | 75.3 | 72.3 | 9.3 | |
CFBI | 2020 | 79.1 | 84.6 | 81.9 | 5 | |
KMN | 2020 | 80.0 | 85.6 | 82.8 | 8.3 | |
GraphMemVOS | 2020 | 80.2 | 85.2 | 82.8 | 5 | |
TAN-DTTM | 2020 | 72.3 | 79.4 | 75.9 | 7.1 | |
FTMU | 2020 | 69.1 | 70.6 | 11.1 | ||
LWLVOS | 2020 | 79.1 | 84.1 | 81.6 | ||
DTMNet | 2020 | 69.1 | 73.9 | 71.5 | 5.9 | |
SAT | 2020 | 68.6 | 76.0 | 72.3 | 39 |
Method | year | Technique | J | F | J&F | FPS |
---|---|---|---|---|---|---|
OSVOS | 2017 | FT | 47.0 | 54.8 | 50.9 | 0.1 |
OSVOS-S | 2018 | FT | 52.9 | 62.1 | 57.5 | 0.22 |
Lucid | 2019 | FT+OF | 63.4 | 69.9 | 66.6 | |
SiamMask | 2019 | 40.6 | 45.8 | 43.2 | 55 | |
RANet | 2019 | 53.4 | 57.3 | 55.4 | 30 | |
FEELVOS | 2019 | 55.1 | 60.4 | 57.8 | 2.2 | |
MHP-VOS | 2019 | 66.4 | 72.7 | 69.5 | ||
A-GAME | 2019 | 49.2 | 55.3 | 52.3 | 15 | |
AGSS-VOS | 2019 | 54.8 | 59.7 | 57.2 | 10 | |
CapsuleVOS | 2019 | 47.4 | 55.2 | 51.3 | 13.5 | |
RVOS | 2019 | 47.9 | 52.6 | 50.3 | ||
STM | 2019 | 69.3 | 75.2 | 72.2 | 6.25 | |
Siam R-CNN | 2020 | 48.0 | 58.6 | 53.3 | 3.1 | |
e-OSVOS | 2020 | FT | 60.9 | 68.6 | 64.8 | 0.29 |
PMVOS | 2020 | 59.5 | 65.3 | 62.4 | 54 | |
TVOS | 2020 | 58.8 | 67.4 | 63.1 | 37 | |
STGNN | 2020 | 59.7 | 66.5 | 63.1 | 6 | |
STM-cycle+GC | 2020 | 55.3 | 62.0 | 58.6 | 6.9 | |
CFBI | 2020 | 71.1 | 78.5 | 74.8 | 5 | |
KMN | 2020 | 74.1 | 80.3 | 77.2 | 8.3 | |
TAN-DTTM | 2020 | 61.3 | 70.3 | 65.4 | 7.1 |
Method | year | Technique | Overall | FPS |
---|---|---|---|---|
OSVOS | 2017 | FT | 58.8 | 0.1 |
SiamMask | 2019 | 52.8 | 55 | |
CapsuleVOS | 2019 | 62.3 | 13.5 | |
AGSS-VOS | 2019 | 71.3 | 12.5 | |
PMVOS | 2020 | 68.6 | 54 | |
TVOS | 2020 | 67.8 | 37 | |
HS2S | 2020 | 68.9 | ||
e-OSVOS | 2020 | FT | 71.4 | 0.29 |
GC | 2020 | 73.2 | 25 | |
FRTM-VOS | 2020 | 72.1 | 21.9 | |
STGNN | 2020 | 73.0 | 6 | |
AFB-URR | 2020 | 79.6 | 4 | |
STM-cycle+GC | 2020 | 70.8 | 13.8 | |
CFBI | 2020 | 81.4 | 5 | |
KMN | 2020 | 81.4 | 8.3 | |
GraphMemVOS | 2020 | 80.2 | 5 | |
LWLVOS | 2020 | 81.5 | ||
DTMNet | 2020 | 65.6 | ||
SAT | 2020 | 63.6 | 39 |
- (DeepSORT) Simple online and realtime tracking with a deep association metric | [ICIP' 17] |
[pdf]
|[code]
- (OSMN) Efficient video object segmentation via network modulation | [CVPR' 18] |
[pdf]
|[code]
- (MaskTrack R-CNN) Video instance segmentation | [ICCV' 19] |
[pdf]
|[code]
- (VIS2019 Winner) Video instance segmentation 2019: A winning approach for combined detection, segmentation, classification and tracking | [ICCV Workshops' 19] |
[pdf]
- (SipMask) SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation | [ECCV' 20] |
[pdf]
|[code]
- (STEm-Seg) STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos | [ECCV' 20] |
[pdf]
|[code]
- (MaskProp) Classifying, segmenting, and tracking object instances in video with mask propagation | [CVPR' 20] |
[pdf]
- (RGNN-VIS) Learning Video Instance Segmentation with Recurrent Graph Neural Networks | [arxiv' 20] |
[pdf]
- (CompFeat) CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation | [AAAI' 21] |
[pdf]
- (Transformer) End-to-End Video Instance Segmentation with Transformers | [arxiv' 20] |
[pdf]
Method | year | Technique | Overall | FPS |
---|---|---|---|---|
DeepSORT | 2017 | 26.1 | ||
OSMN | 2018 | 27.5 | ||
MaskTrack R-CNN | 2019 | 30.3 | 20 | |
VIS2019 Winner | 2019 | 44.8 | <1 | |
SipMask | 2020 | 32.5 | 30 | |
SipMask ms-train | 2020 | 33.7 | 30 | |
STEm-Seg | 2020 | 34.6 | 7 | |
MaskProp | 2020 | 46.6 | <2 | |
RGNN-VIS | 2020 | 37.7 | 25 | |
CompFeat | 2021 | 35.3 | ||
Transformer | 2020 | 35.3 | 27.7/57.7 |
Method | year | Technique | Overall | FPS |
---|---|---|---|---|
DeepSORT | 2017 | 27.2 | ||
OSMN | 2018 | 27.3 | ||
MaskTrack R-CNN | 2019 | 32.3 | 20 |
- (PFPN) Panoptic Feature Pyramid Networks | [CVPR' 19] |
[pdf]
- (AdaptIS) AdaptIS: Adaptive Instance Selection Network | [ICCV' 19] |
[pdf]
|[code]
- (VPS) Video Panoptic Segmentation | [CVPR' 20] |
[pdf]
|[code]
- (Axial-DeepLab) Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | [ECCV' 20] |
[pdf]
|[code]
- (EfficientPS) EfficientPS: Efficient Panoptic Segmentation | [arxiv' 20] |
[pdf]
|[code]
- (Panoptic-DeepLab) Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation | [CVPR' 20] |
[pdf]
|[code]
Method | year | Technique | Cityscapes VAL/TEST PQ | COCO Panoptic PQ | KITTI Panoptic PQ |
---|---|---|---|---|---|
PFPN | 2019 | 58.1/ | 43.0 | 39.3 | |
AdaptIS | 2019 | 62.0/ | |||
VPS | 2020 | 62.2/ | |||
Axial-DeepLab | 2020 | 68.5/66.6 | 43.9 | ||
EfficientPS | 2020 | 67.5/67.1 | 43.7 | ||
Panoptic-DeepLab | 2020 | 64.1/65.5 |
If you have any suggestions about papers, feel free to mail me :)