awesome video object/instance segmentation

A curated list of awesome video object/instance segmentation resources.

Pull requests are welcome to update this repo.

Last updated: 2020/12/19

What is video object segmentation and video instance segmentation?

Video object segmentation is a binary labeling problem aiming to separate specific foreground object(s) from the background region of a video, and each object mask should be linked across frames.

Video instance segmentation not only need to segment foreground object(s)/instance(s), but also have to identify the category of each object.

Dataset

Table of Contents

Video object segmentation

  • A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation | [CVPR' 16] |[pdf]
  • (OSVOS) One-Shot Video Object Segmentation | [CVPR' 17] |[pdf] | [code]
  • (MaskTrack) Learning Video Object Segmentation from Static Images | [CVPR' 17] |[pdf] | [code]
  • (OSVOS-S) Video object segmentation without temporal information | [TPAMI' 18] |[pdf]
  • (Lucid) Lucid Data Dreaming for Video Object Segmentation | [IJCV' 19] |[pdf] | [code]
  • (PTSNet) Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation | [arxiv' 19] |[pdf] | [code]
  • (SiamMask) Fast Online Object Tracking and Segmentation: A Unifying Approach | [CVPR' 19] |[pdf] | [code]
  • (RANet) RANet: Ranking Attention Network for Fast Video Object Segmentation | [ICCV' 19] |[pdf] | [code]
  • (FEELVOS) Feelvos: Fast end-to-end embedding learning for video object segmentation | [CVPR' 19] |[pdf] | [code]
  • (STCNN) Spatiotemporal CNN for Video Object Segmentation | [CVPR' 19] |[pdf] | [code]
  • (MHP-VOS) MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation | [CVPR' 19] |[pdf] | [code]
  • (A-GAME) A Generative Appearance Model for End-to-end Video Object Segmentation | [CVPR' 19] |[pdf] | [code]
  • (RVOS) RVOS: End-to-End Recurrent Network for Video Object Segmentation | [CVPR' 19] |[pdf] | [code]
  • (CapsuleVOS) CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing | [ICCV' 19] |[pdf] | [code]
  • (AGSS-VOS) AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation | [ICCV' 19] |[pdf] | [code]
  • (STM) Video Object Segmentation using Space-Time Memory Networks | [ICCV' 19] |[pdf] | [code]
  • (e-OSVOS) Make One-Shot Video Object Segmentation Efficient Again | [NIPS' 20] |[pdf] | [code]
  • (GC) Fast Video Object Segmentation using the Global Context Module | [CVPR' 20] |[pdf]
  • (FRTM-VOS) Learning Fast and Robust Target Models for Video Object Segmentation | [CVPR' 20] |[pdf] | [code]
  • (PMVOS) PMVOS: Pixel-Level Matching-Based Video Object Segmentation | [arxiv' 20] |[pdf]
  • (CFBI) Collaborative Video Object Segmentation by Foreground-Background Integration | [ECCV' 20] |[pdf] | [code]
  • (TVOS) A Transductive Approach for Video Object Segmentation | [CVPR' 20] |[pdf] | [code]
  • (Siam R-CNN) Siam R-CNN: Visual Tracking by Re-Detection | [CVPR' 20] |[pdf] | [code]
  • (MuG-W) Learning Video Object Segmentation from Unlabeled Videos | [CVPR' 20] |[pdf] | [code]
  • (LWLVOS) Learning What to Learn for Video Object Segmentation | [CVPR' 20] |[pdf]
  • (FTMU) Fast Template Matching and Update for Video Object Tracking and Segmentation | [CVPR' 20] |[pdf] | [code]
  • (SAT) State-Aware Tracker for Real-Time Video Object Segmentation | [CVPR' 20] |[pdf] | [code]
  • (AFB-URR) Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement | [NIPS' 20] |[pdf] | [code]
  • (STM-cycle+GC) Delving into the Cyclic Mechanism in Semi-supervised Video Object Segmentation | [NIPS' 20] |[pdf] | [code]
  • (KMN) Kernelized Memory Network for Video Object Segmentation | [ECCV' 20] |[pdf]
  • (GraphMemVOS) Video Object Segmentation with Episodic Graph Memory Networks | [ECCV' 20] |[pdf]) | [code]
  • (TAN-DTTM) Fast Video Object Segmentation with Temporal Aggregation Network and Dynamic Template Matching | [CVPR' 20] |[pdf]
  • (TTVOS) TTVOS: Lightweight Video Object Segmentation with Adaptive Template Attention Module and Temporal Consistency Loss | [arxiv' 20] |[pdf]
  • (F2Net) F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation | [arxiv' 20] |[pdf]
  • (HS2S) Hybrid-S2S: Video Object Segmentation with Recurrent Networks and Correspondence Matching | [arxiv' 20] |[pdf] | [code]
  • (STGNN) Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation | [arxiv' 20] |[pdf]
  • (DTMNet) Dual Temporal Memory Network for Efficient Video Object Segmentation | [arxiv' 20] |[pdf]

Performance

FT==Online Fine-tuning/Learning

OF==Optical FLow

DAVIS16 VAL

Method year Technique J F J&F FPS
OSVOS 2017 FT 79.8 80.6 80.2 0.1
MaskTrack 2017 FT+OF 79.7 75.4 77.6 0.08
OSVOS-S 2018 FT 85.6 87.5 86.6 0.22
Lucid 2019 FT+OF 83.9 82.0 83.0
SiamMask 2019 71.7 67.8 69.8 55
RANet 2019 85.5 85.4 85.5 30
RANet 2019 FT 86.6 87.6 87.1 0.25
FEELVOS 2019 81.1 82.2 81.7 2.2
STCNN 2019 83.8 83.8 83.8
MHP-VOS 2019 87.6 89.5 88.6
A-GAME 2019 81.5 82.2 81.9 15
STM 2019 88.7 90.1 89.4 6.25
MuG-W 2020 65.7 63.6 64.7
Siam R-CNN 2020 76.8 80.4 78.6 4.2
e-OSVOS 2020 FT 86.6 87.0 86.8 0.29
GC 2020 87.6 85.7 86.6 25
FRTM-VOS 2020 83.5 21.9
PMVOS 2020 86.1 85.1 85.6 54
TTVOS 2020 83.8 39.6
STGNN 2020 85.4 86.0 85.7 6
CFBI 2020 88.3 90.5 89.4 5
KMN 2020 89.5 91.5 90.5 8.3
FTMU 2020 77.5 78.9 11.1
DTMNet 2020 85.9 84.9 85.4 8.3
SAT 2020 82.6 83.6 83.1 39

DAVIS17 VAL

Method year Technique J F J&F FPS
OSVOS 2017 FT 56.6 63.9 60.3 0.1
OSVOS-S 2018 FT 64.7 71.3 68.0 0.22
SiamMask 2019 54.3 58.5 56.4 55
RANet 2019 63.2 68.2 65.7 30
FEELVOS 2019 69.1 74.0 71.6 2.2
STCNN 2019 58.7 64.6 61.7
MHP-VOS 2019 73.4 78.9 76.2
A-GAME 2019 68.5 73.6 71.1 15
AGSS-VOS 2019 64.9 69.9 67.4 10
RVOS 2019 57.5 63.6 60.6
STM 2019 79.2 84.3 81.8 6.25
MuG-W 2020 54.1 58.0 56.1
Siam R-CNN 2020 66.1 75.0 70.6 3.1
e-OSVOS 2020 FT 74.4 80.0 77.2 0.29
GC 2020 69.3 73.5 71.4 25
FRTM-VOS 2020 76.7 21.9
PMVOS 2020 71.2 76.7 74.0 54
TVOS 2020 69.9 74.7 72.3 37
TTVOS 2020 67.8 39.6
STGNN 2020 71.5 77.9 74.7 6
AFB-URR 2020 73.0 76.1 74.6 4
STM-cycle+GC 2020 69.3 75.3 72.3 9.3
CFBI 2020 79.1 84.6 81.9 5
KMN 2020 80.0 85.6 82.8 8.3
GraphMemVOS 2020 80.2 85.2 82.8 5
TAN-DTTM 2020 72.3 79.4 75.9 7.1
FTMU 2020 69.1 70.6 11.1
LWLVOS 2020 79.1 84.1 81.6
DTMNet 2020 69.1 73.9 71.5 5.9
SAT 2020 68.6 76.0 72.3 39

DAVIS17 TEST

Method year Technique J F J&F FPS
OSVOS 2017 FT 47.0 54.8 50.9 0.1
OSVOS-S 2018 FT 52.9 62.1 57.5 0.22
Lucid 2019 FT+OF 63.4 69.9 66.6
SiamMask 2019 40.6 45.8 43.2 55
RANet 2019 53.4 57.3 55.4 30
FEELVOS 2019 55.1 60.4 57.8 2.2
MHP-VOS 2019 66.4 72.7 69.5
A-GAME 2019 49.2 55.3 52.3 15
AGSS-VOS 2019 54.8 59.7 57.2 10
CapsuleVOS 2019 47.4 55.2 51.3 13.5
RVOS 2019 47.9 52.6 50.3
STM 2019 69.3 75.2 72.2 6.25
Siam R-CNN 2020 48.0 58.6 53.3 3.1
e-OSVOS 2020 FT 60.9 68.6 64.8 0.29
PMVOS 2020 59.5 65.3 62.4 54
TVOS 2020 58.8 67.4 63.1 37
STGNN 2020 59.7 66.5 63.1 6
STM-cycle+GC 2020 55.3 62.0 58.6 6.9
CFBI 2020 71.1 78.5 74.8 5
KMN 2020 74.1 80.3 77.2 8.3
TAN-DTTM 2020 61.3 70.3 65.4 7.1

YouTube-VOS VAL

Method year Technique Overall FPS
OSVOS 2017 FT 58.8 0.1
SiamMask 2019 52.8 55
CapsuleVOS 2019 62.3 13.5
AGSS-VOS 2019 71.3 12.5
PMVOS 2020 68.6 54
TVOS 2020 67.8 37
HS2S 2020 68.9
e-OSVOS 2020 FT 71.4 0.29
GC 2020 73.2 25
FRTM-VOS 2020 72.1 21.9
STGNN 2020 73.0 6
AFB-URR 2020 79.6 4
STM-cycle+GC 2020 70.8 13.8
CFBI 2020 81.4 5
KMN 2020 81.4 8.3
GraphMemVOS 2020 80.2 5
LWLVOS 2020 81.5
DTMNet 2020 65.6
SAT 2020 63.6 39

Video instance segmentation

  • (DeepSORT) Simple online and realtime tracking with a deep association metric | [ICIP' 17] |[pdf] | [code]
  • (OSMN) Efficient video object segmentation via network modulation | [CVPR' 18] |[pdf] | [code]
  • (MaskTrack R-CNN) Video instance segmentation | [ICCV' 19] |[pdf] | [code]
  • (VIS2019 Winner) Video instance segmentation 2019: A winning approach for combined detection, segmentation, classification and tracking | [ICCV Workshops' 19] |[pdf]
  • (SipMask) SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation | [ECCV' 20] |[pdf] | [code]
  • (STEm-Seg) STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos | [ECCV' 20] |[pdf] | [code]
  • (MaskProp) Classifying, segmenting, and tracking object instances in video with mask propagation | [CVPR' 20] |[pdf]
  • (RGNN-VIS) Learning Video Instance Segmentation with Recurrent Graph Neural Networks | [arxiv' 20] |[pdf]
  • (CompFeat) CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation | [AAAI' 21] |[pdf]
  • (Transformer) End-to-End Video Instance Segmentation with Transformers | [arxiv' 20] |[pdf]

Performance

YouTube-VIS VAL

Method year Technique Overall FPS
DeepSORT 2017 26.1
OSMN 2018 27.5
MaskTrack R-CNN 2019 30.3 20
VIS2019 Winner 2019 44.8 <1
SipMask 2020 32.5 30
SipMask ms-train 2020 33.7 30
STEm-Seg 2020 34.6 7
MaskProp 2020 46.6 <2
RGNN-VIS 2020 37.7 25
CompFeat 2021 35.3
Transformer 2020 35.3 27.7/57.7

YouTube-VIS TEST

Method year Technique Overall FPS
DeepSORT 2017 27.2
OSMN 2018 27.3
MaskTrack R-CNN 2019 32.3 20

Video panoptic segmentation

  • (PFPN) Panoptic Feature Pyramid Networks | [CVPR' 19] |[pdf]
  • (AdaptIS) AdaptIS: Adaptive Instance Selection Network | [ICCV' 19] |[pdf] | [code]
  • (VPS) Video Panoptic Segmentation | [CVPR' 20] |[pdf] | [code]
  • (Axial-DeepLab) Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | [ECCV' 20] |[pdf] | [code]
  • (EfficientPS) EfficientPS: Efficient Panoptic Segmentation | [arxiv' 20] |[pdf] | [code]
  • (Panoptic-DeepLab) Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation | [CVPR' 20] |[pdf] | [code]

Cityscapes VAL

Method year Technique Cityscapes VAL/TEST PQ COCO Panoptic PQ KITTI Panoptic PQ
PFPN 2019 58.1/ 43.0 39.3
AdaptIS 2019 62.0/
VPS 2020 62.2/
Axial-DeepLab 2020 68.5/66.6 43.9
EfficientPS 2020 67.5/67.1 43.7
Panoptic-DeepLab 2020 64.1/65.5

Contact & Feedback

If you have any suggestions about papers, feel free to mail me :)