hzwer/ECCV2022-RIFE

Optimizing RIFE for denoise

Opened this issue · 3 comments

We are currently starting to use RIFE for temporal video denoise activity (Asd-g/AviSynthPlus-RIFE#2) . But quality of motion compensation is not ideal so it cause either blurring if use simple average blending tool or decrease denoise power if use more bad-blends protected blending engine.

The idea of using RIFE for denoise - RIFE create interpolation (with already partial 2-frames denoise) of 'before' (current-N) and 'after' (current+N) frames using 0.5 time param (interpolation to 'current' time between +N and -N) and the result is blended (spatially only) with current frame (using weight of 1/3 and 2/3 if for RIFE output frame). So if everything going ideal with perfect RIFE motion interpolation we got averaging of noise in 3 frames and get about SQRT(3) decreasing of photon shot noise at natural video. If we need more denoise we combine many RIFE interpolated frames (+-1, +-2 and so on).

Now as RIFE do not still create ideal motion interpolation (also with 2-frames only basis the temporal aliasing increases as +-N frames increases and fast and/or repeating motion) we have some issues and ideas how to make our denoise activity better.

The benefit of using RIFE over existing hardware or software solutions for motion search/motion compensation is usage mostly GPU for compute and some significant part of blending work so many CPU resources left for MPEG encoding. Also RIFE possibly may perform better in noisy content and it is not produce block artifacts and not require additional anti-blockiness processing (many other motion compensation engines are small blocks based and require additional overlapping blending to fix blockiness).

The current 2 (2+) ideas of making RIFE better for denoise activity:

  1. Is it possible to use 3 input frames (current-N, current, current+N) and learn RIFE to make interpolation between current-N and current+N with time 0.5 and with as much as possible precise matching 'current' frame (but with restriction of using current-frame samples to create output interpolated samples) ? May be it is close to 'learning scripts' activity for example for 2x FPS increasing when neural network is learn to interpolate between +1 and -1 frame from current and result is compared with 0 (current real) frame for quality ? Or it is too complex redesign of current RIFE engine if we want to use 3 input frames ?

  2. If 1. is too complex - may be it is possible to left current 2-input frames design but change algorithm so we can provide 'current' and current+N' frames and set time=0 so ask RIFE interpolate current+N frame to 'current' frame (using 'current' frame as objects placement reference, but with strict requirement to use in output only current+N frame samples) ? It will be 'motion compensation' mode where we ask to perform complete motion compensation of 'source current+N' frame based on 'reference current' frame. But not allow to use samples of reference frame in output. As I understand if we currently provide very low time-param like 0.00001 the current RIFE engine will use mostly first frame samples in output ?

  3. For much better understanding of complex and different speed motion it is require to analyse not 2 frames but big sequence of frames - do it planned to make RIFE versions supporting analysis of sequence of frames ? Or it will even much more slow and use much more GPU memory ?

Can we expect some help from RIFE developers in 1 and 2 questions ?

hzwer commented

Sorry for the late reply.
The algorithm about 3 is under development. Since my time in the past half year is mainly occupied by chatgpt-related research, I am not sure when there will be such a function.
Recently we provided a frame prediction algorithm, perhaps more suitable for noise reduction scenarios?https://github.com/megvii-research/CVPR2023-DMVFN
The most difficult problem for me is that I don't know the benchmarks related to denoise, and I don't know how to carry out experiments.

The starting benchmarks may be with clean content only - simply ask RIFE to perform motion compensation of current+N frame to 'current' frame view an use any image similarity/dissimilarity metric to compare resulted frame with 'real current' . So the target for learning process is maximize similarity metric (SSIM/VIF or any available, may be even simple PSNR or SAD).

To test how good is new design with real world noised sources - the simple additive gauss-distrubuted noise may be added. We are not ask to test the denoise quality - it is very simple and already working everywhere process of simple averaging of motion compensated frames (using forward and backward motion compensation of -N and +N frames to 'current').

The most important and very hard to solve task is motion compensation with real world shots of damaged by some noise image sequence (where motion search and compensation engine fails - we got blurring or even shifted duplicating of objects and so on).

Also as we expect some progress in image processing engines it is expected to increase number of supported transforms - not only translate transform but may be rotation, scaling, lighting, skew and other possible 'natural' transforms of textures and objects in real life footages (moving pictures sequences). So system may analyse for many supported transforms and perform Transform Compensation to make +-N frames look as much as possible close to 'current' frame and we can use simple samples averaging to decrease temporal noise. In other model (with translate only transform) it is equal to create set of 'virtual following videocameras' tracking each moving object in the total frame sequence and continue to average image data on inter-frame interval (more than first global camera accumulation interval that can not be more than inter-frame time interval). So each such 'virtual video camera' can have real physical data accumulation time much more original inter-frame interval and got benefit in signal to noise ratio (limited by natural photon shot noise or other noise sources with zero mean value).

So the main and the general task for updated RIFE engine is provide as best as possible motion compensated 'other' frames to 'current' real existed frame (but also damaged by same noise as other frames in first example). As some extension of motion search in noised sources we have some good news in multi-generation motion search approach when we have initial 'blind' motion search engine using only input frames of 'current and ref' full-noised and after this the resulted somehow damaged motion (having still errors because of 2 full noised frames input) data is passed to first generation denoiser. After this we have 2-input motion search engine accepting initial source at 1st input and partially denoised source at 2nd input and make motion search between full noised source and partially denoised. The output refined motion data is passed to next stage denosier and so several generations may be combined in a pipeline. It shows already good benefit on quality of motion vectors refining already at second generation.

"Recently we provided a frame prediction algorithm, perhaps more suitable for noise reduction scenarios?https://github.com/megvii-research/CVPR2023-DMVFN"

From the document published in that repository I see some active development of the neural-networks image processing with some motion prediction and compensation still in progress. But in the list of possible applications in article:
"Video prediction aims to predict future video frames
from the current ones. The task potentially benefits the
study on representation learning [40] and downstream fore-
casting tasks such as human motion prediction [39], au-
tonomous driving [6], and climate change [48], etc. Dur-
ing the last decade, video prediction has been increasingly
studied in both academia and industry community [5, 7]."

One more very important use case of motion estimation and motion compensation (or extrapolation in that case) is denoising of natural shot video content for much better MPEG compressability by current MPEG codecs and may be even design of a next generation video codecs with separation of textures and found transform data in compacted form to get significant compression bebefit over current very limited in this way MPEG codecs (very limited ref textures detected and only translate-transform is analysed and used).

So it is good to keep this application use case in the list of possible applications and put some design effort to support natural noised and may be not very sharp footages (like film-transfers with film grain and not very sharp).

From my tests with RIFE it performs generally well on clean low noised sharp sources from good quality Full HD 3chip ENG/EFP-quality class video camera. But fails mostly totally at the soft and grainy film transfer (cam not track/detect motion). May be I use not very perfect model or may be model was not trained on soft and grainy sources - not know.

About implementation of 2frames input to 1 predicted next (or prevoius) frame: Yes it can be tested in tr=2 temporal denoising:
n-2 and n-1 feed as t-1 and t frames from 2 past frames forward interpolation,
n+2 and n+1 feed as pair of 2 next frames for backward interpolation to the past (engine should be time-axis symmetrical and do not know real time axis direction - sort of TENET movie idea)
got 2 interpolated frames from 2 previous and 2 next - and pass it after interleaving with current n-frame to blending engine like vsTTempSmooth (sample-based, or mvtools blocks-based).

But unfortunately I got user note about currently no implementation of that pytorch-based algorithm can be used in our video-processing environment "Avisynth":
"If someone makes a ncnn/vulkan compatible version then possibly avs version could materialize. None of the direct pytorch variants of any project can run directly in avs ".

So need to find and ask more developers of intermediate API to get this implementation working into Avisynth for testing. Same as we currently have Vulkan-based implementation for RIFE (and some other NN- image processing) and it can be somehow converted and loaded into AVS via AVS-MLRT plugin: https://github.com/Asd-g/avs-mlrt .