MISO-VFI: A Multi-In-Single-Out Network
for Video Frame Interpolation without Optical Flow

Jaemin Lee, Minseok Seo, Sangwoo Lee,
Hyobin Park, Dong-Geol Choi

Project Page, Arxiv

Abstract

In general, deep learning-based video frame interpolation (VFI) methods have predominantly focused on estimating motion vectors between two input frames and warping them to the target time. While this approach has shown impressive performance for linear motion between two input frames, it exhibits limitations when dealing with occlusions and nonlinear movements. Recently, generative models have been applied to VFI to address these issues. However, as VFI is not a task focused on generating plausible images, but rather on predicting accurate intermediate frames between two given frames, performance limitations still persist. In this paper, we propose a multi-in-single-out (MISO) based VFI method that does not rely on motion vector estimation, allowing it to effectively model occlusions and nonlinear motion. Additionally, we introduce a novel motion perceptual loss that enables MISO-VFI to better capture the spatio-temporal correlations within the video frames. Our MISO-VFI method achieves state-of-the-art results on VFI benchmarks Vimeo90K, Middlebury, and UCF101, with a significant performance gap compared to existing approaches.

Citation

@article{lee2023multi,
  title={A Multi-In-Single-Out Network for Video Frame Interpolation without Optical Flow},
  author={Lee, Jaemin and Seo, Minseok and Lee, Sangwoo and Park, Hyobin and Choi, Dong-Geol},
  journal={arXiv preprint arXiv:2311.11602},
  year={2023}
}

Announcement 🎉

Nov. 2023: We achieved Rank 1 (State-of-the-art) for Vimeo90K and UCF-101 tasks in Paperswithcode.
Nov. 2023: Pretrained models are released.

How to use?

Requirements

torch==1.10.0
torchvision==0.11.0
einops==0.6.1
timm==0.9.7
imageio
scikit-image
numpy
opencv-python

Datasets

Vimeo90K datasets We used Xue, Tianfan, et al.'s Dataset for the vimeo experiment.

$ mkdir /data
$ cd /data
$ wget http://data.csail.mit.edu/tofu/dataset/vimeo_septuplet.zip # (82G) For 2-3-2
$ unzip vimeo_septuplet.zip
# or
$ wget http://data.csail.mit.edu/tofu/dataset/vimeo_triplet.zip # (33G) For 1-1-1
$ unzip vimeo_septuplet.zip

Training & Evaluation

train

$ python main.py --dataname vimeo --epochs 100 --batch_size 8 --test_batch_size 8 --vgg_loss True --perceptual True --mars_weight ./weights/kinetics-pretrained.pth --num_workers 16 --ex_name vimeo_exp --lr 0.001 --gpu 0,1,2,3 --data_root /data/vimeo_septuplet

MARS Kinetics pre-trained weight can get from Google Drive (we referred to the model of MARS.)

evaluation

$ python test.py --dataname vimeo --test_batch_size 4 --num_workers 16 --weight ./weights/vimeo-pretrained.pth --gpu 0,1,2,3 --data_root /data/vimeo_septuplet # 2-3-2

# ---
# Results
# test ssim:0.9552, psnr:37.19

$ python test.py --dataname vimeo-triplet --test_batch_size 4 --num_workers 16 --weight ./weights/vimeo-pretrained.pth --gpu 0,1,2,3 --data_root /data/vimeo_triplet --out_frame 1 --in_shape 2 3 256 448 # 1-1-1
# ---
# Results
# test ssim:0.9714, psnr:38.457

If you want to get the vimeo90K pre-trained weight(1-1-1), Access this Google Drive.

Note

Empirically, you can achieve higher performance with large batches and large epochs.

Thanks to

This code is heavily borrowed from SimVP.
We cite MARS's ResNeXt model.
We cite metrics for SSIM and PSNR on E3D-LSTM.
We use LKA-module of Visual-Attention-Network.
We refinement learnable filter module of MagNet.
We use ConvNeXt and LayerNorm of ConvNeXt.

liuguoyou/MISO-VFI

MISO-VFI: A Multi-In-Single-Out Network for Video Frame Interpolation without Optical Flow

Jaemin Lee, Minseok Seo, Sangwoo Lee, Hyobin Park, Dong-Geol Choi