VideoINR

This repository contains the official implementation for VideoINR introduced in the following paper:

VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution
Zeyuan Chen, Yinbo Chen, Jingwen Liu, Xingqian Xu, Vidit Goel, Zhangyang Wang, Humphrey Shi, Xiaolong Wang
CVPR 2022

You can find more visual results and a brief introduction to VideoINR at our project page.

Method Overview

Two consecutive input frames are concatenated and encoded as a discrete feature map. Based on the feature, the spatial and temporal implicit neural representations decode a 3D space-time coordinate to a motion flow vector. We then sample a new feature vector by warping according to the motion flow, and decode it as the RGB prediction of the query coordinate.

Citation

If you find our work useful in your research, please cite:

@inproceedings{chen2022vinr,
  author    = {Chen, Zeyuan and Chen, Yinbo and Liu, Jingwen and Xu, Xingqian and Goel, Vidit and Wang, Zhangyang and Shi, Humphrey and Wang, Xiaolong},
  title     = {VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution},
  journal   = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
}

Environmental Setup

The code is tested in:

Python 3.6
Pytorch 1.4.0
torchvision 0.5.0
Cuda 10.1
Deformable Convolution v2. Following Zooming Slowmo, we adopt CharlesShang's implementation in the submodule.

If you are using Anaconda, the following command can be used to build the environment:

conda create -n videoinr python=3.6
conda activate videoinr
conda install pytorch=1.4 torchvision -c pytorch

pip install opencv-python pillow tqdm pyyaml
cd models/modules/DCNv2/
python setup.py install

Demo

Download the pre-trained model from google drive.
Convert your video of interest to a sequence of images. This process can be completed by many apps, e.g. ffmpeg and AdobePR.

The folder that contains this image sequence should have a structure as follows:

data_path
├── img_1.jpg
├── img_2.jpg
├── ...
├── img_n.jpg

Using VideoINR for performing space-time super-resolution. You can adjust up-sampling scales by setting different space_scale and time_scale.

python demo.py --space_scale 4 --time_scale 8 --data_path [YOUR_DATA_PATH]

The output would be three folders including low-resolution images, bicubic-upsampling images, and the results of VideoINR.

Preparing Dataset

We use the Adobe240 dataset for training.

Download the zip file here which contains the original high FPS videos.
In order to extract frames of each video to a separated folder, change videoFolder to where you save the extracted frames and frameFolder to DATASET_PATH in generate_frames_from_adobe240fps.py and run it. This would automatically split the data into train/test/val set.

python generate_frames_from_adobe240fps.py

Training

Configure training settings, which can be found at options/train. The default training setting can be found at train_zsm.yml. You need to change a few lines in the config file in order to run successfully in your machine:
- name & mode (Line 12 & 13): As mentioned in the paper, we adopt a two-stage training strategy, so there exists two different modes for training set. Adobe and Adobe_a (refer to Line 47 in data/init.py). Adobe fixs the down-sampling scale to 4 while Adobe_a randomly samples down-sampling scales in [2, 4]. For the first stage (0 - 450000 iterations), we set the name & mode to Adobe. For the second stage (450000 - 600000 iterations), we set the name & mode to Adobe_a.
- dataroot_GT & dataroot_LQ (Line 17 & 18): Path to the Adobe240 dataset. Set them as DATASET_PATH/train (dataroot_LQ is not used in current implementation)
- models & training_state (Line 47 & 48): Path to where you want to save the model parameters and training state (for restart training).
Run training code. The default setting needs four RTX 2080Ti for training. Note that for applying the two-stage training strategy, you might have to run train.py twice.

python train.py -opt options/train/train_zsm.yml

Additional Note

Throughout the training process, we calculate the loss by summing distances of all pixels between prediction and ground-truth. However, this can be unreasonable for stage 2 (450000 - 600000 iterations) since the ground-truth images have different resolutions, resulting in different loss scales. Using mean distances for the loss value in stage 2 can be helpful for the final model performance.

Thank @sichun233746 very much for his testing!

Acknowledgments

Our code is built on Zooming-Slow-Mo-CVPR-2020 and LIIF. Thank the authors for sharing their codes!

ryanhe312/VideoINR-Continuous-Space-Time-Super-Resolution