awesome video object/instance segmentation

A curated list of awesome video object/instance segmentation resources.

Pull requests are welcome to update this repo.

Last updated: 2020/12/19

What is video object segmentation and video instance segmentation?

Video object segmentation is a binary labeling problem aiming to separate specific foreground object(s) from the background region of a video, and each object mask should be linked across frames.

Video instance segmentation not only need to segment foreground object(s)/instance(s), but also have to identify the category of each object.

Dataset

DAVIS [Download]
YouTube-VOS and -VIS [Download]
Cityscapes [Download]
CamVid [Download]
COCO [Download]

Video object segmentation
Video instance segmentation

Video object segmentation

A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation | [CVPR' 16] |[pdf]
(OSVOS) One-Shot Video Object Segmentation | [CVPR' 17] |[pdf] | [code]
(MaskTrack) Learning Video Object Segmentation from Static Images | [CVPR' 17] |[pdf] | [code]
(OSVOS-S) Video object segmentation without temporal information | [TPAMI' 18] |[pdf]
(Lucid) Lucid Data Dreaming for Video Object Segmentation | [IJCV' 19] |[pdf] | [code]
(PTSNet) Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation | [arxiv' 19] |[pdf] | [code]
(SiamMask) Fast Online Object Tracking and Segmentation: A Unifying Approach | [CVPR' 19] |[pdf] | [code]
(RANet) RANet: Ranking Attention Network for Fast Video Object Segmentation | [ICCV' 19] |[pdf] | [code]
(FEELVOS) Feelvos: Fast end-to-end embedding learning for video object segmentation | [CVPR' 19] |[pdf] | [code]
(STCNN) Spatiotemporal CNN for Video Object Segmentation | [CVPR' 19] |[pdf] | [code]
(MHP-VOS) MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation | [CVPR' 19] |[pdf] | [code]
(A-GAME) A Generative Appearance Model for End-to-end Video Object Segmentation | [CVPR' 19] |[pdf] | [code]
(RVOS) RVOS: End-to-End Recurrent Network for Video Object Segmentation | [CVPR' 19] |[pdf] | [code]
(CapsuleVOS) CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing | [ICCV' 19] |[pdf] | [code]
(AGSS-VOS) AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation | [ICCV' 19] |[pdf] | [code]
(STM) Video Object Segmentation using Space-Time Memory Networks | [ICCV' 19] |[pdf] | [code]
(e-OSVOS) Make One-Shot Video Object Segmentation Efficient Again | [NIPS' 20] |[pdf] | [code]
(GC) Fast Video Object Segmentation using the Global Context Module | [CVPR' 20] |[pdf]
(FRTM-VOS) Learning Fast and Robust Target Models for Video Object Segmentation | [CVPR' 20] |[pdf] | [code]
(PMVOS) PMVOS: Pixel-Level Matching-Based Video Object Segmentation | [arxiv' 20] |[pdf]
(CFBI) Collaborative Video Object Segmentation by Foreground-Background Integration | [ECCV' 20] |[pdf] | [code]
(TVOS) A Transductive Approach for Video Object Segmentation | [CVPR' 20] |[pdf] | [code]
(Siam R-CNN) Siam R-CNN: Visual Tracking by Re-Detection | [CVPR' 20] |[pdf] | [code]
(MuG-W) Learning Video Object Segmentation from Unlabeled Videos | [CVPR' 20] |[pdf] | [code]
(LWLVOS) Learning What to Learn for Video Object Segmentation | [CVPR' 20] |[pdf]
(FTMU) Fast Template Matching and Update for Video Object Tracking and Segmentation | [CVPR' 20] |[pdf] | [code]
(SAT) State-Aware Tracker for Real-Time Video Object Segmentation | [CVPR' 20] |[pdf] | [code]
(AFB-URR) Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement | [NIPS' 20] |[pdf] | [code]
(STM-cycle+GC) Delving into the Cyclic Mechanism in Semi-supervised Video Object Segmentation | [NIPS' 20] |[pdf] | [code]
(KMN) Kernelized Memory Network for Video Object Segmentation | [ECCV' 20] |[pdf]
(GraphMemVOS) Video Object Segmentation with Episodic Graph Memory Networks | [ECCV' 20] |[pdf]) | [code]
(TAN-DTTM) Fast Video Object Segmentation with Temporal Aggregation Network and Dynamic Template Matching | [CVPR' 20] |[pdf]
(TTVOS) TTVOS: Lightweight Video Object Segmentation with Adaptive Template Attention Module and Temporal Consistency Loss | [arxiv' 20] |[pdf]
(F2Net) F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation | [arxiv' 20] |[pdf]
(HS2S) Hybrid-S2S: Video Object Segmentation with Recurrent Networks and Correspondence Matching | [arxiv' 20] |[pdf] | [code]
(STGNN) Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation | [arxiv' 20] |[pdf]
(DTMNet) Dual Temporal Memory Network for Efficient Video Object Segmentation | [arxiv' 20] |[pdf]

Performance

FT==Online Fine-tuning/Learning

OF==Optical FLow

DAVIS16 VAL

Method	year	Technique	J	F	J&F	FPS
OSVOS	2017	FT	79.8	80.6	80.2	0.1
MaskTrack	2017	FT+OF	79.7	75.4	77.6	0.08
OSVOS-S	2018	FT	85.6	87.5	86.6	0.22
Lucid	2019	FT+OF	83.9	82.0	83.0
SiamMask	2019		71.7	67.8	69.8	55
RANet	2019		85.5	85.4	85.5	30
RANet	2019	FT	86.6	87.6	87.1	0.25
FEELVOS	2019		81.1	82.2	81.7	2.2
STCNN	2019		83.8	83.8	83.8
MHP-VOS	2019		87.6	89.5	88.6
A-GAME	2019		81.5	82.2	81.9	15
STM	2019		88.7	90.1	89.4	6.25
MuG-W	2020		65.7	63.6	64.7
Siam R-CNN	2020		76.8	80.4	78.6	4.2
e-OSVOS	2020	FT	86.6	87.0	86.8	0.29
GC	2020		87.6	85.7	86.6	25
FRTM-VOS	2020				83.5	21.9
PMVOS	2020		86.1	85.1	85.6	54
TTVOS	2020				83.8	39.6
STGNN	2020		85.4	86.0	85.7	6
CFBI	2020		88.3	90.5	89.4	5
KMN	2020		89.5	91.5	90.5	8.3
FTMU	2020		77.5		78.9	11.1
DTMNet	2020		85.9	84.9	85.4	8.3
SAT	2020		82.6	83.6	83.1	39

DAVIS17 VAL

Method	year	Technique	J	F	J&F	FPS
OSVOS	2017	FT	56.6	63.9	60.3	0.1
OSVOS-S	2018	FT	64.7	71.3	68.0	0.22
SiamMask	2019		54.3	58.5	56.4	55
RANet	2019		63.2	68.2	65.7	30
FEELVOS	2019		69.1	74.0	71.6	2.2
STCNN	2019		58.7	64.6	61.7
MHP-VOS	2019		73.4	78.9	76.2
A-GAME	2019		68.5	73.6	71.1	15
AGSS-VOS	2019		64.9	69.9	67.4	10
RVOS	2019		57.5	63.6	60.6
STM	2019		79.2	84.3	81.8	6.25
MuG-W	2020		54.1	58.0	56.1
Siam R-CNN	2020		66.1	75.0	70.6	3.1
e-OSVOS	2020	FT	74.4	80.0	77.2	0.29
GC	2020		69.3	73.5	71.4	25
FRTM-VOS	2020				76.7	21.9
PMVOS	2020		71.2	76.7	74.0	54
TVOS	2020		69.9	74.7	72.3	37
TTVOS	2020				67.8	39.6
STGNN	2020		71.5	77.9	74.7	6
AFB-URR	2020		73.0	76.1	74.6	4
STM-cycle+GC	2020		69.3	75.3	72.3	9.3
CFBI	2020		79.1	84.6	81.9	5
KMN	2020		80.0	85.6	82.8	8.3
GraphMemVOS	2020		80.2	85.2	82.8	5
TAN-DTTM	2020		72.3	79.4	75.9	7.1
FTMU	2020		69.1		70.6	11.1
LWLVOS	2020		79.1	84.1	81.6
DTMNet	2020		69.1	73.9	71.5	5.9
SAT	2020		68.6	76.0	72.3	39

DAVIS17 TEST

Method	year	Technique	J	F	J&F	FPS
OSVOS	2017	FT	47.0	54.8	50.9	0.1
OSVOS-S	2018	FT	52.9	62.1	57.5	0.22
Lucid	2019	FT+OF	63.4	69.9	66.6
SiamMask	2019		40.6	45.8	43.2	55
RANet	2019		53.4	57.3	55.4	30
FEELVOS	2019		55.1	60.4	57.8	2.2
MHP-VOS	2019		66.4	72.7	69.5
A-GAME	2019		49.2	55.3	52.3	15
AGSS-VOS	2019		54.8	59.7	57.2	10
CapsuleVOS	2019		47.4	55.2	51.3	13.5
RVOS	2019		47.9	52.6	50.3
STM	2019		69.3	75.2	72.2	6.25
Siam R-CNN	2020		48.0	58.6	53.3	3.1
e-OSVOS	2020	FT	60.9	68.6	64.8	0.29
PMVOS	2020		59.5	65.3	62.4	54
TVOS	2020		58.8	67.4	63.1	37
STGNN	2020		59.7	66.5	63.1	6
STM-cycle+GC	2020		55.3	62.0	58.6	6.9
CFBI	2020		71.1	78.5	74.8	5
KMN	2020		74.1	80.3	77.2	8.3
TAN-DTTM	2020		61.3	70.3	65.4	7.1

YouTube-VOS VAL

Method	year	Technique	Overall	FPS
OSVOS	2017	FT	58.8	0.1
SiamMask	2019		52.8	55
CapsuleVOS	2019		62.3	13.5
AGSS-VOS	2019		71.3	12.5
PMVOS	2020		68.6	54
TVOS	2020		67.8	37
HS2S	2020		68.9
e-OSVOS	2020	FT	71.4	0.29
GC	2020		73.2	25
FRTM-VOS	2020		72.1	21.9
STGNN	2020		73.0	6
AFB-URR	2020		79.6	4
STM-cycle+GC	2020		70.8	13.8
CFBI	2020		81.4	5
KMN	2020		81.4	8.3
GraphMemVOS	2020		80.2	5
LWLVOS	2020		81.5
DTMNet	2020		65.6
SAT	2020		63.6	39

Video instance segmentation

(DeepSORT) Simple online and realtime tracking with a deep association metric | [ICIP' 17] |[pdf] | [code]
(OSMN) Efficient video object segmentation via network modulation | [CVPR' 18] |[pdf] | [code]
(MaskTrack R-CNN) Video instance segmentation | [ICCV' 19] |[pdf] | [code]
(VIS2019 Winner) Video instance segmentation 2019: A winning approach for combined detection, segmentation, classification and tracking | [ICCV Workshops' 19] |[pdf]
(SipMask) SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation | [ECCV' 20] |[pdf] | [code]
(STEm-Seg) STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos | [ECCV' 20] |[pdf] | [code]
(MaskProp) Classifying, segmenting, and tracking object instances in video with mask propagation | [CVPR' 20] |[pdf]
(RGNN-VIS) Learning Video Instance Segmentation with Recurrent Graph Neural Networks | [arxiv' 20] |[pdf]
(CompFeat) CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation | [AAAI' 21] |[pdf]
(Transformer) End-to-End Video Instance Segmentation with Transformers | [arxiv' 20] |[pdf]

Performance

YouTube-VIS VAL

Method	year	Overall	FPS
DeepSORT	2017	26.1
OSMN	2018	27.5
MaskTrack R-CNN	2019	30.3	20
VIS2019 Winner	2019	44.8	<1
SipMask	2020	32.5	30
SipMask ms-train	2020	33.7	30
STEm-Seg	2020	34.6	7
MaskProp	2020	46.6	<2
RGNN-VIS	2020	37.7	25
CompFeat	2021	35.3
Transformer	2020	35.3	27.7/57.7

YouTube-VIS TEST

Method	year	Overall	FPS
DeepSORT	2017	27.2
OSMN	2018	27.3
MaskTrack R-CNN	2019	32.3	20

Video panoptic segmentation

(PFPN) Panoptic Feature Pyramid Networks | [CVPR' 19] |[pdf]
(AdaptIS) AdaptIS: Adaptive Instance Selection Network | [ICCV' 19] |[pdf] | [code]
(VPS) Video Panoptic Segmentation | [CVPR' 20] |[pdf] | [code]
(Axial-DeepLab) Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | [ECCV' 20] |[pdf] | [code]
(EfficientPS) EfficientPS: Efficient Panoptic Segmentation | [arxiv' 20] |[pdf] | [code]
(Panoptic-DeepLab) Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation | [CVPR' 20] |[pdf] | [code]

Cityscapes VAL

Method	year	Cityscapes VAL/TEST PQ	COCO Panoptic PQ	KITTI Panoptic PQ
PFPN	2019	58.1/	43.0	39.3
AdaptIS	2019	62.0/
VPS	2020	62.2/
Axial-DeepLab	2020	68.5/66.6	43.9
EfficientPS	2020	67.5/67.1		43.7
Panoptic-DeepLab	2020	64.1/65.5

Contact & Feedback

If you have any suggestions about papers, feel free to mail me :)

bo-miao/awsome-video-object-segmentation

awesome video object/instance segmentation

Pull requests are welcome to update this repo.

What is video object segmentation and video instance segmentation?

Dataset

Table of Contents

Video object segmentation

Performance

DAVIS16 VAL

DAVIS17 VAL

DAVIS17 TEST

YouTube-VOS VAL

Video instance segmentation

Performance

YouTube-VIS VAL

YouTube-VIS TEST

Video panoptic segmentation

Cityscapes VAL

Contact & Feedback