YoURVOS

Show Me When and Where: Towards Referring Video Object Segmentation in the Wild

Table of Content

1. Overview, 2. Benchmark, 3. Baseline (OMFormer)

1. Overview

YoURVOS (Youtube Untrimmed Referring Video Object Segmentation) is a benchmark to close the gap between Referring Video Object Segmentation (RVOS) studies and realistic scenarios. Unlike previous RVOS benchmarks, where videos are trimmed to enable text-referred objects to always be present, YoURVOS comes from untrimmed videos. Thus, the targets could appear anytime, anywhere in a video. This poses great challenges to RVOS methods that need to show not only when but also where the objects appear in a video.

2. Benchmark

Method	Venue	Backbone	J&F	J	F	tIoU
ReferFormer	CVPR 2022	ResNet-50	12.0	12.1	11.9	32.2
		ResNet-101	22.4	22.4	22.4	33.4
		Swin-T	22.6	22.7	22.6	34.1
		Swin-L	24.9	24.6	25.2	34.4
		V-Swin-T	23.0	22.8	23.1	33.7
		V-Swin-B	24.6	24.3	24.8	34.5
LBDT	CVPR 2022	ResNet-50	14.6	14.6	14.5	32.6
MTTR	CVPR 2022	V-Swin-T	21.4	21.3	21.6	33.6
UNINEXT	CVPR 2023	ResNet-50	23.1	22.9	23.3	32.6
		Conv-L	24.2	23.9	24.5	32.6
		ViT-L	24.8	24.4	25.2	32.6
R2VOS	ICCV 2023	ResNet-50	24.9	25.0	24.9	35.3
LMPM	ICCV 2023	Swin-T	13.0	12.8	13.3	21.9
DEVA	ICCV 2023	Swin-L	21.9	21.6	22.2	33.6
OnlineRefer	ICCV 2023	ResNet-50	22.5	22.4	22.5	33.8
OnlineRefer	ICCV 2023	Swin-L	25.0	24.4	25.6	34.9
SgMg	ICCV 2023	V-Swin-T	24.3	24.1	24.5	34.4
SgMg	ICCV 2023	V-Swin-B	25.3	25.1	25.5	34.7
SOC	NeurIPS 2023	V-Swin-T	23.5	23.2	23.8	34.4
SOC	NeurIPS 2023	V-Swin-B	24.2	23.8	24.6	33.6
MUTR	AAAI 2024	ResNet-50	22.4	22.3	22.6	33.3
		ResNet-101	23.3	23.1	23.4	33.7
		Swin-L	26.2	25.9	26.5	35.1
		V-Swin-T	23.2	23.1	23.4	33.5
		V-Swin-B	25.7	25.5	26.0	34.6
OMFormer	Ours 2024	ResNet-50	33.7	33.6	33.8	44.9

3. Baseline (OMFormer)

Object-level Multimodal transFormers (OMFormer) for RVOS.

Install and Run

We test the code on Python=3.9, PyTorch=1.10.1, CUDA=11.3

cd baseline
pip install -r requirements.txt
cd models/ops
python setup.py build install
cd ../..

python inference_yourvos.py \
  --freeze_text_encoder \
  --output_dir [path to output] \
  --resume [path to checkpoint]/omformer_r50.pth \
  --ngpu [number of gpus] \
  --batch_size 1 \
  --backbone resnet50 \
  --yourvos_path [path to YoURVOS]

Checkpoint (on Hugging Face 🤗): omformer_r50.pth

YoURVOS test videos (on Hugging Face 🤗): YoURVOS

Evaluate

cd evaluation/vos-benchmark
# J&F
python benchmark.py -g [path to gt] -m [path to predicts] --do_not_skip_first_and_last_frame
# tIoU
python tiou.py [path to predicts] spans.txt

gaomingqi/YoURVOS