VSGAN-tensorrt-docker

Repository to use super resolution models and video frame interpolation models and also trying to speed them up with TensorRT. This repository contains the fastest inference code that you can find, at least I am trying to archive that. Not all codes can use TensorRT due to various reasons, but I try to add that if it works. Further model architectures are planned to be added later on.

I also created a Google Colab:

Usage
Usage example
Video guide (depricated)
Deduplicated inference
Scene change detection
vs-mlrt (C++ TRT)
- multi-gpu
ddfi
VFR (variable refresh rate)
mpv
Color transfer
Benchmarks
License

Currently working networks:

ESRGAN with rlaphoenix/VSGAN and HolyWu/vs-realesrgan
RealESRGAN / RealESERGANVideo with xinntao/Real-ESRGAN and rlaphoenix/VSGAN
RealESRGAN ncnn with styler00dollar/realsr-ncnn-vulkan-python and media2x/realsr-ncnn-vulkan-python
Rife4 with HolyWu/vs-rife
RIFE ncnn with styler00dollar/VapourSynth-RIFE-ncnn-Vulkan and HomeOfVapourSynthEvolution/VapourSynth-RIFE-ncnn-Vulkan
SwinIR with HolyWu/vs-swinir
Sepconv (enhanced) with sniklaus/revisiting-sepconv
EGVSR with Thmen/EGVSR and HolyWu/vs-basicvsrpp
BasicVSR++ with HolyWu/vs-basicvsrpp
RealBasicVSR with ckkelvinchan/RealBasicVSR
RealCUGAN with bilibili/ailab
FILM with google-research/frame-interpolation
PAN with zhaohengyuan1/PAN
IFRNet with ltkong218/IFRNet
M2M with feinanshan/M2M_VFI
IFUNet with 98mxr/IFUNet
eisai with ShuhongChen/eisai-anime-interpolator
SCUNet with cszn/SCUNet
GMFupSS with 98mxr/GMFupSS
ST-MFNet with danielism97/ST-MFNet
VapSR with zhoumumu/VapSR
GMFSS_union with HolyWu version, styler00dollar/vs-gmfss_union, 98mxr/GMFSS_union
AI scene detection with rwightman/pytorch-image-models, snap-research/EfficientFormer (EfficientFormerV2), lucidrains/TimeSformer-pytorch and OpenGVLab/UniFormerV2
GMFSS_Fortuna and GMFSS_Fortuna_union with 98mxr/GMFSS_Fortuna, HolyWu/vs-gmfss_fortuna and styler00dollar/vs-gmfss_fortuna

Also used:

TensorRT C++ inference and python script usage with AmusementClub/vs-mlrt
ddfi with Mr-Z-2697/ddfi-rife (auto dedup-duplication, not an arch)
nix with lucasew/nix-on-colab
custom ffmpeg with styler00dollar/ffmpeg-static-arch-docker
lsmash with AkarinVS/L-SMASH-Works
wwxd with dubhater/vapoursynth-wwxd
scxvid with dubhater/vapoursynth-scxvid
trt precision check and upscale frame skip with mafiosnik777/enhancr

Model	ESRGAN	SRVGGNetCompact	Rife	SwinIR	Sepconv	EGVSR	BasicVSR++	Waifu2x	RealBasicVSR	RealCUGAN	FILM	DPIR	PAN	IFRNet	M2M	IFUNet	eisai	SCUNet	GMFupSS	ST-MFNet	VapSR	GMFSS_union	GMFSS_Fortuna / GMFSS_Fortuna_union
CUDA	-	-	yes (rife40, rife41)	yes	yes	yes	yes	-	yes	yes	yes	-	yes	yes	yes	yes	yes	yes	yes	yes	-	yes (vanilla / wgan)	base / union
TensorRT	yes (torch_tensorrt / C++ TRT)	yes (onnx_tensorrt / C++ TRT) v2, v3	yes	-	-	-	-	yes (C++ TRT)	-	yes (C++ TRT)	-	yes (C++ TRT)	-	-	-	-	-	-	-	-	yes (C++ TRT)	-	-
ncnn	yes, but compile yourself (realsr ncnn models)	yes, but compile yourself (2x)	yes	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-

Some important things:

ncnn does not work in wsl and that means it doesn't work in Windows currently. ncnn will only work if you use docker in linux.
If you are on Windows, install all the latest updates first, otherwise wsl won't work properly. 21H2 minimum.
Do not use webm video, webm is often broken. It can work, but don't complain about broken output afterwards. I would suggest to render webm into mp4 or mkv.
Only use ffmpeg to determine if video is variable framerate (vfr) or not. Other programs do not seem reliable.
Processing vfr video is dangerous, but you can try to use fpsnum and fpsden. Either use these params or render the input video into constant framerate (crf).
x264 can be faster than ffmpeg.
The C++ VS rife extention can be faster than CUDA.
Colabs have a weak cpu, you should try x264 with --opencl. (A100 does not support NVENC and such)

Usage

PSA FOR WINDOWS USERS: Docker Desktop 4.17.1 is broken. Download either 4.16.3 or 4.17.0. Both worked on my Windows 10. I would recommend to use 4.16.3, since another person confirmed it to work on Windows 11. 4.17.1 (which is currently latest) results in Docker not starting which is mentioned in this issue.

# install docker, command for arch
yay -S docker nvidia-docker nvidia-container-toolkit docker-compose docker-buildx

# Download prebuild image from dockerhub (recommended)
docker pull styler00dollar/vsgan_tensorrt:latest

# Build docker manually
# This step is not needed if you already downloaded the docker and is only needed if yo
# want to build it from scratch. Keep in mind that you need to set env variables in windows differently and
# this command will only work in linux. Run that inside that directory
DOCKER_BUILDKIT=1 docker build -t styler00dollar/vsgan_tensorrt:latest .
# If you want to rebuild from scratch or have errors, try to build without cache
DOCKER_BUILDKIT=1 docker build --no-cache -t styler00dollar/vsgan_tensorrt:latest . 

# run the docker with docker-compose
# go into the vsgan folder, inside that folder should be compose.yaml, run this command
# you can adjust folder mounts in the yaml file
# afterwards the vsgan folder will be mounted under `/workspace/tensorrt` and you can navigate 
# into it with `cd tensorrt`
docker-compose run --rm vsgan_tensorrt

# if you have `unauthorized: authentication required` problems, download the docker with 
git clone https://github.com/NotGlop/docker-drag
cd docker-drag
python docker_pull.py styler00dollar/vsgan_tensorrt:latest
docker load -i styler00dollar_vsgan_tensorrt.tar

# run docker with the sh startup script (linux)
sh start_docker.sh

# run docker manually
# the folderpath before ":" will be mounted in the path which follows afterwards
# contents of the vsgan folder should appear inside /workspace/tensorrt
docker run --privileged --gpus all -it --rm -v /home/vsgan_path/:/workspace/tensorrt styler00dollar/vsgan_tensorrt:latest

# you can use it in various ways, ffmpeg example
vspipe -c y4m inference.py - | ffmpeg -i pipe: example.mkv -y
# nvencc example
vspipe -c y4m inference.py - | nvencc -i pipe: --codec av1 -o example.mkv
# x264 example
vspipe -c y4m inference.py - | x264 - --demuxer y4m -o example.mkv -y
# x265 example
vspipe -c y4m inference.py - | x265 - --y4m -o example.mkv -y

# example without vspipe
ffmpeg -f vapoursynth -i inference.py example.mkv -y

# Models are outside of docker image to minimize download size and will be downloaded on demand if you run code.
# If you want specific models you can look in https://github.com/styler00dollar/VSGAN-tensorrt-docker/releases/tag/models 
# or use the download scripts to get all of them. Models are expected to be placed under models/

If docker does not want to start, try this before you use docker:

# fixing docker errors
sudo systemctl start docker
sudo chmod 666 /var/run/docker.sock

Windows is mostly similar, but the path needs to be changed slightly:

Example for C://path
docker run --privileged --gpus all -it --rm -v /mnt/c/path:/workspace/tensorrt vsgan_tensorrt:latest
docker run --privileged --gpus all -it --rm -v //c/path:/workspace/tensorrt vsgan_tensorrt:latest

Usage example

Small minimalistic example of how to configure inference. If you only want to process one video, then edit video path in inference.py

video_path = "test.mkv"

and then afterwards edit inference_config.py. Small example:

import sys
sys.path.append("/workspace/tensorrt/")
import vapoursynth as vs
core = vs.core
vs_api_below4 = vs.__api_version__.api_major < 4
core.num_threads = 4
core.std.LoadPlugin(path="/usr/lib/x86_64-linux-gnu/libffms2.so")

from src.rife import RIFE
from src.vfi_inference import vfi_inference

def inference_clip(video_path):
    clip = core.ffms2.Source(source=video_path, cache=False)
    clip = vs.core.resize.Bicubic(clip, format=vs.RGBS, matrix_in_s="709")
    # apply one or multiple models, will be applied in order
    model_inference = RIFE(scale=1, fastmode=False, ensemble=True, model_version="rife46", fp16=True)
    clip = vfi_inference(model_inference=model_inference, clip=clip, multi=2)
    # return clip
    clip = vs.core.resize.Bicubic(clip, format=vs.YUV420P8, matrix_s="709")
    return clip

Then use the commands above to render. For example:

vspipe -c y4m inference.py - | ffmpeg -i pipe: example.mkv

Video will be rendered without sound and other attachments. You can add that manually to the ffmpeg command.

To process videos in batch and copy their properties like audio and subtitle to another file, you need to use main.py. Edit filepaths and file extention:

input_dir = "/workspace/tensorrt/input/"
output_dir = "/workspace/tensorrt/output/"
files = glob.glob(input_dir + "/**/*.webm", recursive=True)

and configure inference_config.py like wanted. Afterwards just run

python main.py

Video guide (depricated)

WARNING: I RECOMMEND READING THE README INSTEAD. THE VIDEO SHOULD GET RE-DONE AT SOME POINT.

If you are confused, here is a Youtube video showing how to use Python API based TensorRT on Windows. That's the easiest way to get my code running, but I would recommend trying to create .engine files instead. I wrote instructions for that further down below under vs-mlrt (C++ TRT). The difference in speed can be quite big. Look at benchmarks for further details.

Deduplicated inference

Calculate similarity between frames with HomeOfVapourSynthEvolution/VapourSynth-VMAF.

# requires yuv, convert if it isn't
clip = vs.core.resize.Bicubic(clip, format=vs.YUV420P8, matrix_s="709")
# adding metric to clip property
# 0 = PSNR, 1 = PSNR-HVS, 2 = SSIM, 3 = MS-SSIM, 4 = CIEDE2000
offs1 = core.std.BlankClip(clip, length=1) + clip[:-1]
offs1 = core.std.CopyFrameProps(offs1, clip)
clip = core.vmaf.Metric(clip, offs1, 2)
# convert to rgbs if needed
clip = vs.core.resize.Bicubic(clip, format=vs.RGBS, matrix_in_s="709")

The properties in the clip will then be used to skip similar frames.

Scene change detection

Scene change detection is implemented in various different ways. To use traditional scene change without ai you can do:

clip = core.misc.SCDetect(clip=clip, threshold=0.100)

The clip property will then be used in frame interpolation inference.

Recently I started experimenting in training my own scene change detect models and I used a dataset with 272.016 images (90.884 triplets) which includes everything from animation to real video (vimeo90k + animeinterp + custom data). So these should work on any kind of video.

clip = scene_detect(clip, model_name="efficientnetv2_b0", thresh=0.98)

Warning: Keep in mind that different models may require a different thresh to be good.

I think that efficientnetv2_b0 is a good balance between speed and results. It overall did quite good. The other models which are included are not listed in an order. They looked all looked ok, but you would need to test yourself to dertermine an opinion.

My personal favorites would be efficientnetv2_b0, efficientformerv2_s0, maxvit_small and swinv2_small for video interpolation tasks. Even if they overdetect a little, the main point is to avoid bad interpolation frames and the detection of bigger differences and scene changes is key because of that. Models will have a hard time discerning if bigger differences are a scene change and handle it in their own way. Some will trigger more and some less.

Sidenote: "overdetect" is a bit hard to define with animation. There is no objective way of saying what frames are similar for drawn animation compared to irl videos. With a fast scene, fighting scene, zooming scene or scenes with particle effects covering a lot of the screen bigger differences can happen, but it does not necessarily mean a scene change. What about partial transitions and only partially changing screens? These are based on my opinion.

Model list:

efficientnetv2_b0: Good overall
efficientnetv2_b0+rife46
efficientformerv2_s0: good overall
efficientformerv2_s0+rife46
maxvit_small: good, but can overdetect at high movement
maxvit_small+rife46
regnetz_005: good overall
repvgg_b0: does barely overdetect, but seems to miss a few frames
resnetrs50: a bit hit and miss, but does not overdetect
resnetv2_50: might miss a bit, needs lower thresh like 0.9
rexnet_100: not too much and not too little, not perfect tho
swinv2_small: detects more than efficientnetv2_b0, but detects a bit too much at high movement
swinv2_small+rife46
TimeSformer: it's alright, but might overdetect a little

Models that I trained but seemed to be bad:

hornet_tiny_7x7
renset50
STAM
volo_d1
tf_efficientnetv2_xl_in21k
resnext50_32x4d
nfnet_f0
swsl_resnet18
poolformer_m36
densenet121

Interesting observations:

Applying means/stds seemingly worsened results, despite people doing that as standard practise.
Applying image augmentation worsened results.
Training with higher batchsize made detections a little more stable, but maybe that was placebo and a result of more finetuning.

Comparison to traditional methods:

wwxd and scxvid suffer from overdetection (at least in drawn animation).
The json that master-of-zen/Av1an produces with --sc-only --sc-method standard --scenes test.json returns too little scene changes. Changing the method does not really influence a lot. Not reliable enough for vfi.
I can't be bothered to Breakthrough/PySceneDetect get working with vapousynth with FrameEval and by default it only works with video or image sequence as input. I may try in the future, but I don't understand why I cant just input two images.
misc.SCDetect seemed like the best traditional vapoursynth method that does currently exist, but I thought I could try to improve. It struggles harder with similar colors and tends to skip more changes compared to ai methods.

vs-mlrt (C++ TRT)

You need to convert onnx models into engines. You need to do that on the same system where you want to do inference. Download onnx models from here or from my Github page. You can technically just use any ONNX model you want or convert a pth into onnx with convert_esrgan_to_onnx.py or convert_compact_to_onnx.py. Inside the docker, you do one of the following commands:

Good default choice:

trtexec --fp16 --onnx=model.onnx --minShapes=input:1x3x8x8 --optShapes=input:1x3x720x1280 --maxShapes=input:1x3x1080x1920 --saveEngine=model.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference

With some arguments known for speedup (Assuming enough vram for 4 stream inference):

trtexec --fp16 --onnx=model.onnx --minShapes=input:1x3x8x8 --optShapes=input:1x3x720x1280 --maxShapes=input:1x3x1080x1920 --saveEngine=model.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference --infStreams=4 --builderOptimizationLevel=4

Be aware that DPIR (color) needs 4 channels.

trtexec --fp16 --onnx=dpir_drunet_color.onnx --minShapes=input:1x4x8x8 --optShapes=input:1x4x720x1280 --maxShapes=input:1x4x1080x1920 --saveEngine=model.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference

Rife needs 8 channels. Setting fasterDynamicShapes0805 since trtexec recommends it.

trtexec --fp16 --onnx=rife.onnx --minShapes=input:1x8x64x64 --optShapes=input:1x8x720x1280 --maxShapes=input:1x8x1080x1920 --saveEngine=model.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference --preview=+fasterDynamicShapes0805

rvpV2 needs 6 channels, but does not support variable shapes.

trtexec --fp16 --onnx=rvp2.onnx --saveEngine=model.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference

and put that engine path into inference_config.py. Only do FP16 if your GPU does support it.

Recommended arguments:

--tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT 
--infStreams=4 (and then using num_streams=4 in mlrt)
--builderOptimizationLevel=4 (5 can be result in segfault, default is 3)

Not recommended arguments which also showed reduction in speed:

--heuristic
--refit
--maxAuxStreams=4
--preview="+fasterDynamicShapes0805,+profileSharing0806"
--tacticSources=+CUDNN,+CUBLAS,+CUBLAS_LT,+EDGE_MASK_CONVOLUTIONS,+JIT_CONVOLUTIONS (turning all on)

Testing was done on a 4090 with shuffle cugan.

Warnings:

If you use the FP16 onnx you need to use RGBH colorspace, if you use FP32 onnx you need to use RGBS colorspace in inference_config.py
Engines are system specific, don't use across multiple systems
Don't use reuse engines for different GPUs.
If you run out of memory, then you need to adjust the resolutions in that command. If your video is bigger than what you can input in the command, use tiling.

multi-gpu

Thanks to tepete who figured it out, there is also a way to do inference on multipe GPUs.

stream0 = core.std.SelectEvery(core.trt.Model(clip, engine_path="models/engines/model.engine", num_streams=2, device_id=0), cycle=3, offsets=0)
stream1 = core.std.SelectEvery(core.trt.Model(clip, engine_path="models/engines/model.engine", num_streams=2, device_id=1), cycle=3, offsets=1)
stream2 = core.std.SelectEvery(core.trt.Model(clip, engine_path="models/engines/model.engine", num_streams=2, device_id=2), cycle=3, offsets=2)
clip = core.std.Interleave([stream0, stream1, stream2])

ddfi

To quickly explain what ddfi is, the repository Mr-Z-2697/ddfi-rife deduplicates frames and interpolates between frames. Normally, frames which are duplicated can create a stuttering visual effect and to mitigate that, a higher interpolation factor is used on scenes which have a duplicated frames to compensate.

Visual examples from that repository:

comp.mp4

To use it, first you need to edit ddfi.py to select your interpolator of choice and then also apply the desired framerate. The official code uses 8x and I suggest you do so too. Small example:

clip = core.misc.SCDetect(clip=clip, threshold=0.100)
clip = core.rife.RIFE(clip, model=9, sc=True, skip=False, multiplier=8)

clip = core.vfrtocfr.VFRToCFR(
    clip, os.path.join(tmp_dir, "tsv2nX8.txt"), 192000, 1001, True
) # 23.97 * 8

Afterwards, you need to use deduped_vfi.py similar to how you used main.py. Adjust paths and file extention.

VFR

Warning: Using variable refresh rate video input will result in desync errors. To check if a video is do

ffmpeg -i video_Name.mp4 -vf vfrdet -f null -

and look at the final line. If it is not zero, then it means it is variable refresh rate. Example:

[Parsed_vfrdet_0 @ 0x56518fa3f380] VFR:0.400005 (15185/22777) min: 1801 max: 3604)

To go around this issue, specify fpsnum and fpsden in inference_config.py

clip = core.ffms2.Source(source='input.mkv', fpsnum = 24000, fpsden = 1001, cache=False)

or convert everything to constant framerate with ffmpeg.

ffmpeg -i video_input.mkv -vsync cfr -crf 10 -c:a copy video_out.mkv

or use my vfr_to_cfr.py to process a folder.

mpv

It is also possible to directly pipe the video into mpv, but you most likely wont be able to archive realtime speed. If you use a very efficient model, it may be possible on a very good GPU. Only tested in Manjaro.

yay -S pulseaudio

# start docker with docker-compose
# same instructions as above, but delete compose.yaml and rename compose_mpv.yaml to compose.yaml 
docker-compose run --rm vsgan_tensorrt

# start docker manually
docker run --rm -i -t \
    --network host \
    -e DISPLAY \
    -v /home/vsgan_path/:/workspace/tensorrt \
    --ipc=host \
    --privileged \
    --gpus all \
    -e PULSE_COOKIE=/run/pulse/cookie \
    -v ~/.config/pulse/cookie:/run/pulse/cookie \
    -e PULSE_SERVER=unix:${XDG_RUNTIME_DIR}/pulse/native \
    -v ${XDG_RUNTIME_DIR}/pulse/native:${XDG_RUNTIME_DIR}/pulse/native \
    vsgan_tensorrt:latest
    
# run mpv
vspipe --y4m inference.py - | mpv -
# with custom audio and subtitles
vspipe --y4m inference.py - | mpv - --audio-file=file.aac --sub-files=file.ass
# to increase the buffer cache, you can use
--demuxer-max-bytes=250MiB

Color transfer (experimental)

A small script for color transfer is available. Currently it can only be used outside of VapourSynth. Since it uses color-matcher as a dependency, you need to install it first. I only tested it on a single image for now, but it may be usable for video sequences.

pip install docutils
git clone https://github.com/hahnec/color-matcher
cd color-matcher
python setup.py install

You can choose between rgb, lab, ycbcr, lum, pdf, sot, hm, reinhard, mvgd, mkl, hm-mvgd-hm and hm-mkl-hm. Specify folders.

python color_transfer.py -s input -t target -o output -algo mkl -threads 8

Benchmarks

Warnings:

Keep in mind that these benchmarks can get outdated very fast due to rapid code development and configurations.
The default is ffmpeg.
ModifyFrame is depricated. Trying to use FrameEval everywhere and is used by default.
ncnn did a lot of performance enhancements lately, so results may be a bit better.
TensorRT docker version and ONNX opset seem to influence speed but that wasn't known for quite some time. I have a hard time pinpointing which TensorRT and ONNX opset was used. Take benchmark as a rough indicator.
Colab may change hardware like CPU at any point.
Sometimes it takes a very long time to reach the final speed. It can happen that not enough time was waited.
3090¹ (+11900k) benches most likely were affected by power lowered power limit.
3090² (+5950x) system provided by Piotr Rencławowicz for benchmarking purposes.
int8 does not automatically mean usable model. It can differ from normal inference quite a lot without adjusting the model.
thread_queue_size means -thread_queue_size 2488320.
"*" indicates benchmarks which were done with vspipe file.py -p . instead of piping into ffmpeg and rendering to avoid cpu bottleneck.
4090 data fluctuating due to teamviewer cpu load and uses 11900k.
4090² uses 5950x.
4090³ uses 13900k.

ⓘ means that model not public yet

Compact (2x)	480p	720p	1080p
rx470 vs+ncnn (np+no tile+tta off)	2.7	1.6	0.6
1070ti vs+ncnn (np+no tile+tta off)	4.2	2	0.9
1070ti (ONNX-TRT+FrameEval)	12	6.1	2.8
1070ti (C++ TRT+FrameEval+num_streams=6)	14	6.7	3
3060ti (ONNX-TRT+FrameEval)	?	7.1	3.2
3060ti (C++ TRT+FrameEval+num_streams=5)	?	15.97	7.83
3060ti VSGAN 2x	?	3.6	1.77
3060ti ncnn (Windows binary) 2x	?	4.2	1.2
3060ti Joey 2x	?	0.87	0.36
3070 (ONNX-TRT+FrameEval)	20	7.55	3.36
3090¹ (ONNX-TRT+FrameEval)	?	?	6.7
3090² (vs+TensorRT8.4+C++ TRT+vs_threads=20+num_streams=20+opset15)	105	47	21
2x3090² (vs+TensorRT8.4+C++ TRT+num_streams=22+opset15)	133	55	23
V100 (Colab) (vs+CUDA)	8.4	3.8	1.6
V100 (Colab) (vs+TensorRT8+ONNX-TRT+FrameEval)	8.3	3.8	1.7
V100 (Colab High RAM) (vs+CUDA+FrameEval)	29	13	6
V100 (Colab High RAM) (vs+TensorRT7+ONNX-TRT+FrameEval)	21	12	5.5
V100 (Colab High RAM) (vs+TensorRT8.2GA+ONNX-TRT+FrameEval)	21	12	5.5
V100 (Colab High RAM) (vs+TensorRT8.4+C++ TRT+num-streams=15)	?	?	6.6
A100 (Colab) (vs+CUDA+FrameEval)	40	19	8.5
A100 (Colab) (vs+TensorRT8.2GA+ONNX-TRT+FrameEval)	44	21	9.5
A100 (Colab) (vs+TensorRT8.2GA+C++ TRT+ffmpeg+FrameEval+num_streams=50)	52.72	24.37	11.84
A100 (Colab) (vs+TensorRT8.2GA) (C++ TRT+x264 (--opencl)+FrameEval+num_streams=50)	57.16	26.25	12.42
A100 (Colab) (vs+onnx+FrameEval)	26	12	4.9
A100 (Colab) (vs+quantized onnx+FrameEval)	26	12	5.7
A100 (Colab) (jpg+CUDA)	28.2 (9 Threads)	28.2 (7 Threads)	9.96 (4 Threads)
4090 (vs+TesnorRT8.4GA+opset16+12 vs threads)	135	59	25
4090 (vs+TesnorRT8.4GA+opset16+12 vs threads+ffv1)	155	72	35
4090 (vs+TensorRT8.4GA+opset16+12 vs threads+thread_queue_size)	200	91	X
6700xt (vs_threads=4+mlrt ncnn)	? / 7.7*	? / 3.25*	? / 1.45*

Compact (4x)	480p	720p	1080p
1070ti TensorRT8 docker (ONNX-TensorRT+FrameEval)	11	5.6	X
3060ti TensorRT8 docker (ONNX-TensorRT+FrameEval)	?	6.1	2.7
3060ti TensorRT8 docker 2x (C++ TRT+FrameEval+num_streams=5)	?	11	5.24
3060ti VSGAN 4x	?	3	1.3
3060ti ncnn (Windows binary) 4x	?	0.85	0.53
3060ti Joey 4x	?	0.25	0.11
A100 (Colab) (vs+CUDA+FrameEval)	12	5.6	2.9
A100 (Colab) (jpg+CUDA)	?	?	3 (4 Threads)
4090³ (TensorRT8.4GA+10 vs threads+fp16)	?	? / 56* (5 streams)	? / 19.4* (2 streams)

UltraCompact (2x)	480p	720p	1080p
4090²(2) (TensorRT8.4GA+vs_threads=4+num_streams=4+opset16+fp16)	?	?	? / 55.1*
4090²(2) (TensorRT8.4GA+vs_threads=4+num_streams=4+opset16+int8)	?	?	? / 57.7*
6700xt (vs_threads=4+mlrt ncnn)	? / 14.5*	? / 6.1*	? / 2.76*

cugan (2x)	480p	720p	1080p
1070ti (vs+TensorRT8.4+ffmpeg+C++ TRT+num_streams=2+no tiling+opset13)	6	2.7	OOM
V100 (Colab) (vs+CUDA+ffmpeg+FrameEval)	7	3.1	?
V100 (Colab High RAM) (vs+CUDA+ffmpeg+FrameEval)	21	9.7	4
V100 (Colab High RAM) (vs+TensorRT8.4+ffmpeg+C++ TRT+num_streams=3+no tiling+opset13)	30	14	6
A100 (Colab High RAM) (vs+TensorRT8.4+x264 (--opencl)+C++ TRT+vs threads=8+num_streams=8+no tiling+opset13)	53.8	24.4	10.9
3090² (vs+TensorRT8.4+ffmpeg+C++ TRT+vs_threads=8+num_streams=5+no tiling+opset13)	79	35	15
2x3090² (vs+TensorRT8.4+ffmpeg+C++ TRT+vs_threads=12+num_streams=5+no tiling+opset13)	131	53	23
4090 (vs+TensorRT8.4GA+ffmpeg+C++ TRT+vs_threads=12+num_streams=6+no tiling+opset13)	117	53	24
4090 (vs+TensorRT8.4GA+ffmpeg+C++ TRT+vs_threads=12+num_streams=5+no tiling+opset13+int8)	?	?	17
4090 (vs+TensorRT8.4GA+ffmpeg+C++ TRT+vs_threads=12+num_streams=5+no tiling+opset13+int8+ffv1)	132	61	29
6700xt (vs_threads=4+mlrt ncnn)	? / 3.3*	? / 1.3*	OOM (512px tiling ? / 0.39*)

ESRGAN 4x (64mb) (23b+64nf)	480p	720p	1080p
1070ti TensorRT8 docker (Torch-TensorRT+ffmpeg+FrameEval)	0.5	0.2	>0.1
3060ti TensorRT8 docker (Torch-TensorRT+ffmpeg+FrameEval)	?	0.7	0.29
3060ti Cupscale (Pytorch)	?	0.13	0.044
3060ti Cupscale (ncnn)	?	0.1	0.04
3060ti Joey	?	0.095	0.043
V100 (Colab) (Torch-TensorRT8.2GA+ffmpeg+FrameEval)	1.8	0.8	?
V100 (Colab High VRAM) (C++ TensorRT8.2GA+x264 (--opencl)+FrameEval+no tiling)	2.46	OOM (OpenCL)	OOM (OpenCL)
V100 (Colab High VRAM) (C++ TensorRT8.2GA+x264+FrameEval+no tiling)	2.49	1.14	0.47
A100 (Colab) (Torch-TensorRT8.2GA+ffmpeg+FrameEval)	5.6	2.6	1.1
3090² (C++ TRT+vs_threads=20+num_threads=2+no tiling+opset14)	3.4	1.5	0.7
2x3090² (C++ TRT+vs_threads=20+num_threads=2+no tiling+opset14)	7.0	3.2	1.5

ESRGAN 2x (64mb) (23b+64nf)	480p	720p	1080p
4090 (C++ TensorRT8.4GA+ffmpeg+int8+12 vs threads+4 num_streams+fp16)	? / 6.1*	? / ?	? / ?
4090 (C++ TensorRT8.4GA+ffmpeg+int8+12 vs threads+1 num_streams+int8)	? / 17.4*	? / 7.1*	? / 3.1*

Note: The offical RealESRGAN repository uses 6b (6 blocks) for the anime model.

RealESRGAN (4x) (6b+64nf)	480p	720p	1080p
3060ti (vs+TensorRT8+ffmpeg+C++ TRT+num_streams=2)	?	1.7	0.75
V100 (Colab High RAM) (vs+TensorRT8.2GA+x264 (--opencl)+C++ TRT+num_streams=1+no tiling)	6.82	3.15	OOM (OpenCL)
V100 (Colab High RAM) (vs+TensorRT8.2GA+x264+C++ TRT+num_streams=1+no tiling)	?	?	1.39
A100 (vs+TensorRT8.2GA+x264 (--opencl)+C++ TRT+num_streams=3+no tiling)	14.65	6.74	2.76
3090² (C++ TRT+vs_threads=20+num_threads=2+no tiling+opset14)	11	4.8	2.3
2x3090² (C++ TRT+vs_threads=10+num_threads=2+no tiling+opset14)	22	9.5	4.2
4090 (C++ TensorRT8.4GA+ffmpeg+12 vs threads+1 num_streams+ffv1+opset16+fp16)	19 / 19* (2 streams)	?	?
4090 (C++ TensorRT8.4GA+ffmpeg+12 vs threads+1 num_streams+ffv1+opset16+int8)	34 (4 streams) / 50* (6 streams)	? / ?	? / 5.7* (1 stream)
4090³ (C++ TensorRT8.5+vs_threads=4+num_streams=1+fp16+(--heuristic)	?	? / 6.9*	? / 3.1*
4090³ (C++ TensorRT8.5+vs_threads=4+num_streams=1+fp16)	?	? / 6.9*	? / 3.1*

RealESRGAN (2x) (6b+64nf)	480p	720p	1080p
1070ti (vs+TensorRT8+ffmpeg+C++ TRT+num_streams=1+no tiling+opset15)	0.9	0.8	0.3
3060ti (vs+TensorRT8+ffmpeg+C++ TRT+num_streams=1)	?	3.12	1.4
V100 (Colab High RAM / 8CPU) (vs+TensorRT8.2GA+x264 (--opencl)+C++ TRT+num_streams=3+no tiling+opset15)	5.09	4.56	2.02
V100 (Colab High RAM / 8CPU) (vs+TensorRT8.2GA+ffmpeg+C++ TRT+num_streams=3+no tiling+opset15)	5.4	4.8	2.2
3090² (C++ TRT+vs_threads=20+num_threads=6+no tiling+opset16) (+dropout)	13	5.8	2.7
2x3090² (C++ TRT+vs_threads=20+num_threads=6+no tiling+opset16) (+dropout)	26	11	5.3
4090 (C++ TRT+TensorRT8.4GA+vs_threads=6+num_threads=6+no tiling+opset16+"--best") (+dropout)	?	?	? / 12*

RealESRGAN (2x) (3b+64nf+dropout)ⓘ	480p	720p	1080p
3060ti (vs+TensorRT8+ffmpeg+C++ TRT+num_streams=2)	?	5.69	2.64
V100 (Colab High RAM / 8CPU) (vs+TensorRT8.4GA+ffmpeg+C++ TRT+num_streams=4+no tiling+opset15)	10	9.4	4.2
3090² (C++ TRT+vs_threads=20+num_threads=6+no tiling+opset15)	24	11	5.2
2x3090 (C++ TRT+vs_threads=20+num_threads=6+no tiling+opset15)	51	23	10

Rife4.6 technically is fastmode=True, since contextnet/unet was removed.

Rife4+vs (fastmode False, ensemble False)	480p	720p	1080p
1070ti (vs+ffmpeg+ModifyFrame)	61	30	15
3060ti (vs+ffmpeg+ModifyFrame)	?	45	24

Rife4+vs (fastmode False, ensemble True)	480p	720p	1080p
1070ti Python (vs+ffmpeg+ModifyFrame)	27	13	9.6
1070ti C++ NCNN	?	?	10
3060ti (vs+ffmpeg+ModifyFrame)	?	36	20
3090² (CUDA+vs_threads=20)	70	52	27
3090² (C++ NCNN+vs_threads=20+ncnn_threads=8)	137	65	31
V100 (Colab) (vs+ffmpeg+ModifyFrame)	30	16	7.3
V100 (Colab High RAM) (vs+x264+ModifyFrame)	48.5	33	19.2
V100 (Colab High RAM) (vs+x264+FrameEval)	48.2	35.5	20.6
V100 (Colab High RAM) (vs+x265+FrameEval)	15.2	9.7	4.6
V100 (Colab High RAM / 8CPU) (vs+x264+C++ NCNN (7 threads))	70	35	17
A100 (Colab) (vs+CUDA+ffmpeg+ModifyFrame)	54	39	23
A100 (Colab) (jpg+CUDA+ffmpeg+ModifyFrame)	?	?	19.92 (14 Threads)
4090 (vs+CUDA+ffmpeg+FrameEval+12 vs threads) (rife40)	61	61	36
4090 (ncnn+8 threads+12 vs threads) (rife4.0)	254	130	60

Rife4+vs (fastmode True, ensemble False)	480p	720p	1080p
1070ti Python (ffmpeg+ModifyFrame)	62	31	14
1070ti (C++ NCNN) (rife46)	?	?	30
1070ti (TensorRT8.5+num_streams=3) (rife46)	?	?	27
3060ti (CUDA+ffmpeg+ModifyFrame)	?	66	33
3090² (CUDA+ffmpeg+FrameEval+vs_threads=20)	121	80	38
3090² (C++ NCNN+vs_threads=20+ncnn_threads=8)	341	142	63
3090³ (TensorRT8.5+6 vs_threads)	? / 331.9* (9 streams)	? / 275.3* (7 streams)	? / 166.3* (7 streams)
4090 (ncnn+8 threads+12 vs threads) (rife4.0)	470	198	98
4090 (ncnn+8 threads+12 vs threads) (rife4.4)	-	-	98
4090 (ncnn+8 threads+12 vs threads+ffv1) (rife4.4)	-	-	129 / 128*
4090 (ncnn+8 threads+12 vs threads) (rife4.6)	455	215	100 / 136*
4090² (ncnn+2 threads+4 vs threads+ffmpeg (ultrafast)) (rife4.6)	?	?	164
4090 (TensorRT8.5+num_streams 8+num_threads=6+stacking method) (rife46)	?	?	? / 146*
4090 (TensorRT8.5+num_streams 8+num_threads=6+int8+ffv1+stacking method) (rife46)	?	?	123 / 156*
4090³ (TensorRT8.5+vs_threads=4+fp16) (rife46)	?	? / 541* (num_streams=14)	? / 288* (num_streams=10)
V100 (Colab) (ffmpeg+ModifyFrame)	34	17	7.6
V100 (Colab High RAM / 8CPU) (vs+x264+FrameEval)	64	43	25
V100 (Colab High RAM / 8CPU) (vs+x264+C++ NCNN (8 threads))	136	65	29
A100 (Colab) (ffmpeg+ModifyFrame)	92	56	29
A100 (Colab/12CPU) (ncnn+8 threads+12 vs threads) (rife40)	208	103	46
A100 (Colab/12CPU) (ncnn+8 threads+12 vs threads+ffv1) (rife40)	87	97	48
6700xt (vs_trheads=4, num_threads=2)	? / 258.5*	? / 122.4*	? / 55.8*

Rife4+vs (fastmode True, ensemble True)	480p	720p	1080p
1070ti (PyTorch+ffmpeg+ModifyFrame)	41	20	9.8
1070ti (C++ NCNN) (rife46)	?	?	16
1070ti (TensorRT8.5+num_streams=2) (rife46)	?	?	14
3060ti (ffmpeg+ModifyFrame)	?	49	24
3090¹ (ffmpeg+ModifyFrame)	?	90.3	45
4090 (vs+CUDA+ffmpeg+FrameEval) (rife46)	84	80	41
4090 (ncnn+8 threads+12 vs threads) (rife4.6)	280	165	76
4090 (ncnn+8 threads+12 vs threads) (rife4.6+ffv1)	222	162	80
4090³ (TensorRT8.5+vs_threads=4+fp16) (rife46)	?	320 / 401.6* (num_streams=14)	160 / 207* (num_streams=10)
A100 (Colab/12CPU) (ncnn+8 threads+12 vs threads) (rife46)	154	86	43
A100 (Colab/12CPU) (ncnn+8 threads+12 vs threads+ffv1) (rife46)	86	86	43
6700xt (vs_trheads=4, num_threads=2)	? / 129.7*	? / 60.4*	? / 28*

Benchmarks made with HolyWu version with threading and partial TensorRT and without setting tactic to JIT_CONVOLUTIONS and EDGE_MASK_CONVOLUTIONS due to performance penalty. I added a modified version as a plugin to VSGAN, but I need to add enhancements to my own repo later.

GMFSS_union	480p	720p	1080p
4090 (num_threads=8, num_streams=3, RGBH, TRT8.6, matmul_precision=medium)	?	? / 44.6*	? / 15.5*

GMFSS_fortuna_union	480p	720p	1080p
4090 (num_threads=8, num_streams=3, RGBH, TRT8.6, matmul_precision=medium)	?	? / 47.7*	? / 16.5*
4090 (num_threads=8, num_streams=3, RGBH, TRT8.6, matmul_precision=medium, @torch.compile(mode="default", fullgraph=True))	?	? / 48.4*	? / 16.7*

EGVSR (4x, interval=5)	480p	720p	1080p
1070ti	4.4	Ram OOM / 2.2*	VRAM OOM

RealBasicVSR	480p	720p	1080p
1070ti	0.3	OOM	OOM
A100 (Colab)	1.2	?	?

Sepconv	480p	720p	1080p
V100 (Colab)	22	11	4.9
3090² (vs+CUDA)	30	14	6.2

CAIN (2 groups)	480p	720p	1080p
A100 (Colab)	76	47	25
3090² (vs+CUDA)	120	65	31

FILM	480p	720p	1080p
V100 (Colab High RAM) (vs+CUDA)	9.8	4.7	2.1

IFRNet (small model)	480p	720p	1080p
V100 (Colab High RAM / 8CPU) (vs+x264+FrameEval)	78	47	23

IFRNet (large model)	480p	720p	1080p
V100 (Colab High RAM / 8CPU) (vs+x264+FrameEval)	?	?	15

DPIR	480p	720p	1080p
3090¹ (TensorRT8+C++ TRT+ffmpeg+vs threads=7+num_streams=5)	?	?	16
4090 (num_streams=13+12 vs threads)	121	52	23
4090 (num_streams=13+12 vs threads+thread_queue_size)	121	54	23
4090 (num_streams=13+12 vs threads+ffv1+thread_queue_size)	121	55	25
4090 (num_streams=13+12 vs threads+ffv1+int8)	?	?	52
4090 (num_streams=13+12 vs threads+ffv1+int8+thread_queue_size)	?	?	44

SCUNet	480p	720p	1080p
4090 (12 vs threads)	10	?	?

ST-MFNet	480p	720p	1080p
1070ti	1.6	OOM	OOM

License

This code uses code from other repositories, but the code I wrote myself is under BSD3.

orochiwolf/VSGAN-tensorrt-docker