/VSGAN-tensorrt-docker

Using VapourSynth with super resolution and interpolation models and speeding them up with TensorRT.

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

VSGAN-tensorrt-docker

Repository to use super resolution models and video frame interpolation models and also trying to speed them up with TensorRT. This repository contains the fastest inference code that you can find, at least I am trying to archive that. Not all codes can use TensorRT due to various reasons, but I try to add that if it works. Further model architectures are planned to be added later on.

I also created a Google Colab: Open In Colab

Table of contents


Currently working networks:

Also used:

Model ESRGAN SRVGGNetCompact Rife SwinIR Sepconv EGVSR BasicVSR++ Waifu2x RealBasicVSR RealCUGAN FILM DPIR PAN IFRNet M2M IFUNet eisai SCUNet GMFupSS ST-MFNet VapSR GMFSS_union GMFSS_Fortuna / GMFSS_Fortuna_union
CUDA - - yes (rife40, rife41) yes yes yes yes - yes yes yes - yes yes yes yes yes yes yes yes - yes (vanilla / wgan) base / union
TensorRT yes (torch_tensorrt / C++ TRT) yes (onnx_tensorrt / C++ TRT) v2, v3 yes - - - - yes (C++ TRT) - yes (C++ TRT) - yes (C++ TRT) - - - - - - - - yes (C++ TRT) - -
ncnn yes, but compile yourself (realsr ncnn models) yes, but compile yourself (2x) yes - - - - - - - - - - - - - - - - - - - -

Some important things:

  • ncnn does not work in wsl and that means it doesn't work in Windows currently. ncnn will only work if you use docker in linux.
  • If you are on Windows, install all the latest updates first, otherwise wsl won't work properly. 21H2 minimum.
  • Do not use webm video, webm is often broken. It can work, but don't complain about broken output afterwards. I would suggest to render webm into mp4 or mkv.
  • Only use ffmpeg to determine if video is variable framerate (vfr) or not. Other programs do not seem reliable.
  • Processing vfr video is dangerous, but you can try to use fpsnum and fpsden. Either use these params or render the input video into constant framerate (crf).
  • x264 can be faster than ffmpeg.
  • The C++ VS rife extention can be faster than CUDA.
  • Colabs have a weak cpu, you should try x264 with --opencl. (A100 does not support NVENC and such)

Usage

PSA FOR WINDOWS USERS: Docker Desktop 4.17.1 is broken. Download either 4.16.3 or 4.17.0. Both worked on my Windows 10. I would recommend to use 4.16.3, since another person confirmed it to work on Windows 11. 4.17.1 (which is currently latest) results in Docker not starting which is mentioned in this issue.

# install docker, command for arch
yay -S docker nvidia-docker nvidia-container-toolkit docker-compose docker-buildx

# Download prebuild image from dockerhub (recommended)
docker pull styler00dollar/vsgan_tensorrt:latest

# Build docker manually
# This step is not needed if you already downloaded the docker and is only needed if yo
# want to build it from scratch. Keep in mind that you need to set env variables in windows differently and
# this command will only work in linux. Run that inside that directory
DOCKER_BUILDKIT=1 docker build -t styler00dollar/vsgan_tensorrt:latest .
# If you want to rebuild from scratch or have errors, try to build without cache
DOCKER_BUILDKIT=1 docker build --no-cache -t styler00dollar/vsgan_tensorrt:latest . 

# run the docker with docker-compose
# go into the vsgan folder, inside that folder should be compose.yaml, run this command
# you can adjust folder mounts in the yaml file
# afterwards the vsgan folder will be mounted under `/workspace/tensorrt` and you can navigate 
# into it with `cd tensorrt`
docker-compose run --rm vsgan_tensorrt

# if you have `unauthorized: authentication required` problems, download the docker with 
git clone https://github.com/NotGlop/docker-drag
cd docker-drag
python docker_pull.py styler00dollar/vsgan_tensorrt:latest
docker load -i styler00dollar_vsgan_tensorrt.tar

# run docker with the sh startup script (linux)
sh start_docker.sh

# run docker manually
# the folderpath before ":" will be mounted in the path which follows afterwards
# contents of the vsgan folder should appear inside /workspace/tensorrt
docker run --privileged --gpus all -it --rm -v /home/vsgan_path/:/workspace/tensorrt styler00dollar/vsgan_tensorrt:latest

# you can use it in various ways, ffmpeg example
vspipe -c y4m inference.py - | ffmpeg -i pipe: example.mkv -y
# nvencc example
vspipe -c y4m inference.py - | nvencc -i pipe: --codec av1 -o example.mkv
# x264 example
vspipe -c y4m inference.py - | x264 - --demuxer y4m -o example.mkv -y
# x265 example
vspipe -c y4m inference.py - | x265 - --y4m -o example.mkv -y

# example without vspipe
ffmpeg -f vapoursynth -i inference.py example.mkv -y

# Models are outside of docker image to minimize download size and will be downloaded on demand if you run code.
# If you want specific models you can look in https://github.com/styler00dollar/VSGAN-tensorrt-docker/releases/tag/models 
# or use the download scripts to get all of them. Models are expected to be placed under models/

If docker does not want to start, try this before you use docker:

# fixing docker errors
sudo systemctl start docker
sudo chmod 666 /var/run/docker.sock

Windows is mostly similar, but the path needs to be changed slightly:

Example for C://path
docker run --privileged --gpus all -it --rm -v /mnt/c/path:/workspace/tensorrt vsgan_tensorrt:latest
docker run --privileged --gpus all -it --rm -v //c/path:/workspace/tensorrt vsgan_tensorrt:latest

Usage example

Small minimalistic example of how to configure inference. If you only want to process one video, then edit video path in inference.py

video_path = "test.mkv"

and then afterwards edit inference_config.py. Small example:

import sys
sys.path.append("/workspace/tensorrt/")
import vapoursynth as vs
core = vs.core
vs_api_below4 = vs.__api_version__.api_major < 4
core.num_threads = 4
core.std.LoadPlugin(path="/usr/lib/x86_64-linux-gnu/libffms2.so")

from src.rife import RIFE
from src.vfi_inference import vfi_inference

def inference_clip(video_path):
    clip = core.ffms2.Source(source=video_path, cache=False)
    clip = vs.core.resize.Bicubic(clip, format=vs.RGBS, matrix_in_s="709")
    # apply one or multiple models, will be applied in order
    model_inference = RIFE(scale=1, fastmode=False, ensemble=True, model_version="rife46", fp16=True)
    clip = vfi_inference(model_inference=model_inference, clip=clip, multi=2)
    # return clip
    clip = vs.core.resize.Bicubic(clip, format=vs.YUV420P8, matrix_s="709")
    return clip

Then use the commands above to render. For example:

vspipe -c y4m inference.py - | ffmpeg -i pipe: example.mkv

Video will be rendered without sound and other attachments. You can add that manually to the ffmpeg command.

To process videos in batch and copy their properties like audio and subtitle to another file, you need to use main.py. Edit filepaths and file extention:

input_dir = "/workspace/tensorrt/input/"
output_dir = "/workspace/tensorrt/output/"
files = glob.glob(input_dir + "/**/*.webm", recursive=True)

and configure inference_config.py like wanted. Afterwards just run

python main.py

Video guide (depricated)

WARNING: I RECOMMEND READING THE README INSTEAD. THE VIDEO SHOULD GET RE-DONE AT SOME POINT.

If you are confused, here is a Youtube video showing how to use Python API based TensorRT on Windows. That's the easiest way to get my code running, but I would recommend trying to create .engine files instead. I wrote instructions for that further down below under vs-mlrt (C++ TRT). The difference in speed can be quite big. Look at benchmarks for further details.

Tutorial

Deduplicated inference

Calculate similarity between frames with HomeOfVapourSynthEvolution/VapourSynth-VMAF.

# requires yuv, convert if it isn't
clip = vs.core.resize.Bicubic(clip, format=vs.YUV420P8, matrix_s="709")
# adding metric to clip property
# 0 = PSNR, 1 = PSNR-HVS, 2 = SSIM, 3 = MS-SSIM, 4 = CIEDE2000
offs1 = core.std.BlankClip(clip, length=1) + clip[:-1]
offs1 = core.std.CopyFrameProps(offs1, clip)
clip = core.vmaf.Metric(clip, offs1, 2)
# convert to rgbs if needed
clip = vs.core.resize.Bicubic(clip, format=vs.RGBS, matrix_in_s="709")

The properties in the clip will then be used to skip similar frames.

Scene change detection

Scene change detection is implemented in various different ways. To use traditional scene change without ai you can do:

clip = core.misc.SCDetect(clip=clip, threshold=0.100)

The clip property will then be used in frame interpolation inference.

Recently I started experimenting in training my own scene change detect models and I used a dataset with 272.016 images (90.884 triplets) which includes everything from animation to real video (vimeo90k + animeinterp + custom data). So these should work on any kind of video.

clip = scene_detect(clip, model_name="efficientnetv2_b0", thresh=0.98)

Warning: Keep in mind that different models may require a different thresh to be good.

I think that efficientnetv2_b0 is a good balance between speed and results. It overall did quite good. The other models which are included are not listed in an order. They looked all looked ok, but you would need to test yourself to dertermine an opinion.

My personal favorites would be efficientnetv2_b0, efficientformerv2_s0, maxvit_small and swinv2_small for video interpolation tasks. Even if they overdetect a little, the main point is to avoid bad interpolation frames and the detection of bigger differences and scene changes is key because of that. Models will have a hard time discerning if bigger differences are a scene change and handle it in their own way. Some will trigger more and some less.

Sidenote: "overdetect" is a bit hard to define with animation. There is no objective way of saying what frames are similar for drawn animation compared to irl videos. With a fast scene, fighting scene, zooming scene or scenes with particle effects covering a lot of the screen bigger differences can happen, but it does not necessarily mean a scene change. What about partial transitions and only partially changing screens? These are based on my opinion.

Model list:

  • efficientnetv2_b0: Good overall
  • efficientnetv2_b0+rife46
  • efficientformerv2_s0: good overall
  • efficientformerv2_s0+rife46
  • maxvit_small: good, but can overdetect at high movement
  • maxvit_small+rife46
  • regnetz_005: good overall
  • repvgg_b0: does barely overdetect, but seems to miss a few frames
  • resnetrs50: a bit hit and miss, but does not overdetect
  • resnetv2_50: might miss a bit, needs lower thresh like 0.9
  • rexnet_100: not too much and not too little, not perfect tho
  • swinv2_small: detects more than efficientnetv2_b0, but detects a bit too much at high movement
  • swinv2_small+rife46
  • TimeSformer: it's alright, but might overdetect a little

Models that I trained but seemed to be bad:

  • hornet_tiny_7x7
  • renset50
  • STAM
  • volo_d1
  • tf_efficientnetv2_xl_in21k
  • resnext50_32x4d
  • nfnet_f0
  • swsl_resnet18
  • poolformer_m36
  • densenet121

Interesting observations:

  • Applying means/stds seemingly worsened results, despite people doing that as standard practise.
  • Applying image augmentation worsened results.
  • Training with higher batchsize made detections a little more stable, but maybe that was placebo and a result of more finetuning.

Comparison to traditional methods:

  • wwxd and scxvid suffer from overdetection (at least in drawn animation).
  • The json that master-of-zen/Av1an produces with --sc-only --sc-method standard --scenes test.json returns too little scene changes. Changing the method does not really influence a lot. Not reliable enough for vfi.
  • I can't be bothered to Breakthrough/PySceneDetect get working with vapousynth with FrameEval and by default it only works with video or image sequence as input. I may try in the future, but I don't understand why I cant just input two images.
  • misc.SCDetect seemed like the best traditional vapoursynth method that does currently exist, but I thought I could try to improve. It struggles harder with similar colors and tends to skip more changes compared to ai methods.

vs-mlrt (C++ TRT)

You need to convert onnx models into engines. You need to do that on the same system where you want to do inference. Download onnx models from here or from my Github page. You can technically just use any ONNX model you want or convert a pth into onnx with convert_esrgan_to_onnx.py or convert_compact_to_onnx.py. Inside the docker, you do one of the following commands:

Good default choice:

trtexec --fp16 --onnx=model.onnx --minShapes=input:1x3x8x8 --optShapes=input:1x3x720x1280 --maxShapes=input:1x3x1080x1920 --saveEngine=model.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference

With some arguments known for speedup (Assuming enough vram for 4 stream inference):

trtexec --fp16 --onnx=model.onnx --minShapes=input:1x3x8x8 --optShapes=input:1x3x720x1280 --maxShapes=input:1x3x1080x1920 --saveEngine=model.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference --infStreams=4 --builderOptimizationLevel=4

Be aware that DPIR (color) needs 4 channels.

trtexec --fp16 --onnx=dpir_drunet_color.onnx --minShapes=input:1x4x8x8 --optShapes=input:1x4x720x1280 --maxShapes=input:1x4x1080x1920 --saveEngine=model.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference

Rife needs 8 channels. Setting fasterDynamicShapes0805 since trtexec recommends it.

trtexec --fp16 --onnx=rife.onnx --minShapes=input:1x8x64x64 --optShapes=input:1x8x720x1280 --maxShapes=input:1x8x1080x1920 --saveEngine=model.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference --preview=+fasterDynamicShapes0805

rvpV2 needs 6 channels, but does not support variable shapes.

trtexec --fp16 --onnx=rvp2.onnx --saveEngine=model.engine --tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT --skipInference

and put that engine path into inference_config.py. Only do FP16 if your GPU does support it.

Recommended arguments:

--tacticSources=+CUDNN,-CUBLAS,-CUBLAS_LT 
--infStreams=4 (and then using num_streams=4 in mlrt)
--builderOptimizationLevel=4 (5 can be result in segfault, default is 3)

Not recommended arguments which also showed reduction in speed:

--heuristic
--refit
--maxAuxStreams=4
--preview="+fasterDynamicShapes0805,+profileSharing0806"
--tacticSources=+CUDNN,+CUBLAS,+CUBLAS_LT,+EDGE_MASK_CONVOLUTIONS,+JIT_CONVOLUTIONS (turning all on)

Testing was done on a 4090 with shuffle cugan.

Warnings:

  • If you use the FP16 onnx you need to use RGBH colorspace, if you use FP32 onnx you need to use RGBS colorspace in inference_config.py
  • Engines are system specific, don't use across multiple systems
  • Don't use reuse engines for different GPUs.
  • If you run out of memory, then you need to adjust the resolutions in that command. If your video is bigger than what you can input in the command, use tiling.

multi-gpu

Thanks to tepete who figured it out, there is also a way to do inference on multipe GPUs.

stream0 = core.std.SelectEvery(core.trt.Model(clip, engine_path="models/engines/model.engine", num_streams=2, device_id=0), cycle=3, offsets=0)
stream1 = core.std.SelectEvery(core.trt.Model(clip, engine_path="models/engines/model.engine", num_streams=2, device_id=1), cycle=3, offsets=1)
stream2 = core.std.SelectEvery(core.trt.Model(clip, engine_path="models/engines/model.engine", num_streams=2, device_id=2), cycle=3, offsets=2)
clip = core.std.Interleave([stream0, stream1, stream2])

ddfi

To quickly explain what ddfi is, the repository Mr-Z-2697/ddfi-rife deduplicates frames and interpolates between frames. Normally, frames which are duplicated can create a stuttering visual effect and to mitigate that, a higher interpolation factor is used on scenes which have a duplicated frames to compensate.

Visual examples from that repository:

comp.mp4

To use it, first you need to edit ddfi.py to select your interpolator of choice and then also apply the desired framerate. The official code uses 8x and I suggest you do so too. Small example:

clip = core.misc.SCDetect(clip=clip, threshold=0.100)
clip = core.rife.RIFE(clip, model=9, sc=True, skip=False, multiplier=8)

clip = core.vfrtocfr.VFRToCFR(
    clip, os.path.join(tmp_dir, "tsv2nX8.txt"), 192000, 1001, True
) # 23.97 * 8

Afterwards, you need to use deduped_vfi.py similar to how you used main.py. Adjust paths and file extention.

VFR

Warning: Using variable refresh rate video input will result in desync errors. To check if a video is do

ffmpeg -i video_Name.mp4 -vf vfrdet -f null -

and look at the final line. If it is not zero, then it means it is variable refresh rate. Example:

[Parsed_vfrdet_0 @ 0x56518fa3f380] VFR:0.400005 (15185/22777) min: 1801 max: 3604)

To go around this issue, specify fpsnum and fpsden in inference_config.py

clip = core.ffms2.Source(source='input.mkv', fpsnum = 24000, fpsden = 1001, cache=False)

or convert everything to constant framerate with ffmpeg.

ffmpeg -i video_input.mkv -vsync cfr -crf 10 -c:a copy video_out.mkv

or use my vfr_to_cfr.py to process a folder.

mpv

It is also possible to directly pipe the video into mpv, but you most likely wont be able to archive realtime speed. If you use a very efficient model, it may be possible on a very good GPU. Only tested in Manjaro.

yay -S pulseaudio

# start docker with docker-compose
# same instructions as above, but delete compose.yaml and rename compose_mpv.yaml to compose.yaml 
docker-compose run --rm vsgan_tensorrt

# start docker manually
docker run --rm -i -t \
    --network host \
    -e DISPLAY \
    -v /home/vsgan_path/:/workspace/tensorrt \
    --ipc=host \
    --privileged \
    --gpus all \
    -e PULSE_COOKIE=/run/pulse/cookie \
    -v ~/.config/pulse/cookie:/run/pulse/cookie \
    -e PULSE_SERVER=unix:${XDG_RUNTIME_DIR}/pulse/native \
    -v ${XDG_RUNTIME_DIR}/pulse/native:${XDG_RUNTIME_DIR}/pulse/native \
    vsgan_tensorrt:latest
    
# run mpv
vspipe --y4m inference.py - | mpv -
# with custom audio and subtitles
vspipe --y4m inference.py - | mpv - --audio-file=file.aac --sub-files=file.ass
# to increase the buffer cache, you can use
--demuxer-max-bytes=250MiB

Color transfer (experimental)

A small script for color transfer is available. Currently it can only be used outside of VapourSynth. Since it uses color-matcher as a dependency, you need to install it first. I only tested it on a single image for now, but it may be usable for video sequences.

pip install docutils
git clone https://github.com/hahnec/color-matcher
cd color-matcher
python setup.py install

You can choose between rgb, lab, ycbcr, lum, pdf, sot, hm, reinhard, mvgd, mkl, hm-mvgd-hm and hm-mkl-hm. Specify folders.

python color_transfer.py -s input -t target -o output -algo mkl -threads 8

Benchmarks

Warnings:

  • Keep in mind that these benchmarks can get outdated very fast due to rapid code development and configurations.
  • The default is ffmpeg.
  • ModifyFrame is depricated. Trying to use FrameEval everywhere and is used by default.
  • ncnn did a lot of performance enhancements lately, so results may be a bit better.
  • TensorRT docker version and ONNX opset seem to influence speed but that wasn't known for quite some time. I have a hard time pinpointing which TensorRT and ONNX opset was used. Take benchmark as a rough indicator.
  • Colab may change hardware like CPU at any point.
  • Sometimes it takes a very long time to reach the final speed. It can happen that not enough time was waited.
  • 3090¹ (+11900k) benches most likely were affected by power lowered power limit.
  • 3090² (+5950x) system provided by Piotr Rencławowicz for benchmarking purposes.
  • int8 does not automatically mean usable model. It can differ from normal inference quite a lot without adjusting the model.
  • thread_queue_size means -thread_queue_size 2488320.
  • "*" indicates benchmarks which were done with vspipe file.py -p . instead of piping into ffmpeg and rendering to avoid cpu bottleneck.
  • 4090 data fluctuating due to teamviewer cpu load and uses 11900k.
  • 4090² uses 5950x.
  • 4090³ uses 13900k.

ⓘ means that model not public yet

Compact (2x) 480p 720p 1080p
rx470 vs+ncnn (np+no tile+tta off) 2.7 1.6 0.6
1070ti vs+ncnn (np+no tile+tta off) 4.2 2 0.9
1070ti (ONNX-TRT+FrameEval) 12 6.1 2.8
1070ti (C++ TRT+FrameEval+num_streams=6) 14 6.7 3
3060ti (ONNX-TRT+FrameEval) ? 7.1 3.2
3060ti (C++ TRT+FrameEval+num_streams=5) ? 15.97 7.83
3060ti VSGAN 2x ? 3.6 1.77
3060ti ncnn (Windows binary) 2x ? 4.2 1.2
3060ti Joey 2x ? 0.87 0.36
3070 (ONNX-TRT+FrameEval) 20 7.55 3.36
3090¹ (ONNX-TRT+FrameEval) ? ? 6.7
3090² (vs+TensorRT8.4+C++ TRT+vs_threads=20+num_streams=20+opset15) 105 47 21
2x3090² (vs+TensorRT8.4+C++ TRT+num_streams=22+opset15) 133 55 23
V100 (Colab) (vs+CUDA) 8.4 3.8 1.6
V100 (Colab) (vs+TensorRT8+ONNX-TRT+FrameEval) 8.3 3.8 1.7
V100 (Colab High RAM) (vs+CUDA+FrameEval) 29 13 6
V100 (Colab High RAM) (vs+TensorRT7+ONNX-TRT+FrameEval) 21 12 5.5
V100 (Colab High RAM) (vs+TensorRT8.2GA+ONNX-TRT+FrameEval) 21 12 5.5
V100 (Colab High RAM) (vs+TensorRT8.4+C++ TRT+num-streams=15) ? ? 6.6
A100 (Colab) (vs+CUDA+FrameEval) 40 19 8.5
A100 (Colab) (vs+TensorRT8.2GA+ONNX-TRT+FrameEval) 44 21 9.5
A100 (Colab) (vs+TensorRT8.2GA+C++ TRT+ffmpeg+FrameEval+num_streams=50) 52.72 24.37 11.84
A100 (Colab) (vs+TensorRT8.2GA) (C++ TRT+x264 (--opencl)+FrameEval+num_streams=50) 57.16 26.25 12.42
A100 (Colab) (vs+onnx+FrameEval) 26 12 4.9
A100 (Colab) (vs+quantized onnx+FrameEval) 26 12 5.7
A100 (Colab) (jpg+CUDA) 28.2 (9 Threads) 28.2 (7 Threads) 9.96 (4 Threads)
4090 (vs+TesnorRT8.4GA+opset16+12 vs threads) 135 59 25
4090 (vs+TesnorRT8.4GA+opset16+12 vs threads+ffv1) 155 72 35
4090 (vs+TensorRT8.4GA+opset16+12 vs threads+thread_queue_size) 200 91 X
6700xt (vs_threads=4+mlrt ncnn) ? / 7.7* ? / 3.25* ? / 1.45*
Compact (4x) 480p 720p 1080p
1070ti TensorRT8 docker (ONNX-TensorRT+FrameEval) 11 5.6 X
3060ti TensorRT8 docker (ONNX-TensorRT+FrameEval) ? 6.1 2.7
3060ti TensorRT8 docker 2x (C++ TRT+FrameEval+num_streams=5) ? 11 5.24
3060ti VSGAN 4x ? 3 1.3
3060ti ncnn (Windows binary) 4x ? 0.85 0.53
3060ti Joey 4x ? 0.25 0.11
A100 (Colab) (vs+CUDA+FrameEval) 12 5.6 2.9
A100 (Colab) (jpg+CUDA) ? ? 3 (4 Threads)
4090³ (TensorRT8.4GA+10 vs threads+fp16) ? ? / 56* (5 streams) ? / 19.4* (2 streams)
UltraCompact (2x) 480p 720p 1080p
4090²(2) (TensorRT8.4GA+vs_threads=4+num_streams=4+opset16+fp16) ? ? ? / 55.1*
4090²(2) (TensorRT8.4GA+vs_threads=4+num_streams=4+opset16+int8) ? ? ? / 57.7*
6700xt (vs_threads=4+mlrt ncnn) ? / 14.5* ? / 6.1* ? / 2.76*
cugan (2x) 480p 720p 1080p
1070ti (vs+TensorRT8.4+ffmpeg+C++ TRT+num_streams=2+no tiling+opset13) 6 2.7 OOM
V100 (Colab) (vs+CUDA+ffmpeg+FrameEval) 7 3.1 ?
V100 (Colab High RAM) (vs+CUDA+ffmpeg+FrameEval) 21 9.7 4
V100 (Colab High RAM) (vs+TensorRT8.4+ffmpeg+C++ TRT+num_streams=3+no tiling+opset13) 30 14 6
A100 (Colab High RAM) (vs+TensorRT8.4+x264 (--opencl)+C++ TRT+vs threads=8+num_streams=8+no tiling+opset13) 53.8 24.4 10.9
3090² (vs+TensorRT8.4+ffmpeg+C++ TRT+vs_threads=8+num_streams=5+no tiling+opset13) 79 35 15
2x3090² (vs+TensorRT8.4+ffmpeg+C++ TRT+vs_threads=12+num_streams=5+no tiling+opset13) 131 53 23
4090 (vs+TensorRT8.4GA+ffmpeg+C++ TRT+vs_threads=12+num_streams=6+no tiling+opset13) 117 53 24
4090 (vs+TensorRT8.4GA+ffmpeg+C++ TRT+vs_threads=12+num_streams=5+no tiling+opset13+int8) ? ? 17
4090 (vs+TensorRT8.4GA+ffmpeg+C++ TRT+vs_threads=12+num_streams=5+no tiling+opset13+int8+ffv1) 132 61 29
6700xt (vs_threads=4+mlrt ncnn) ? / 3.3* ? / 1.3* OOM (512px tiling ? / 0.39*)
ESRGAN 4x (64mb) (23b+64nf) 480p 720p 1080p
1070ti TensorRT8 docker (Torch-TensorRT+ffmpeg+FrameEval) 0.5 0.2 >0.1
3060ti TensorRT8 docker (Torch-TensorRT+ffmpeg+FrameEval) ? 0.7 0.29
3060ti Cupscale (Pytorch) ? 0.13 0.044
3060ti Cupscale (ncnn) ? 0.1 0.04
3060ti Joey ? 0.095 0.043
V100 (Colab) (Torch-TensorRT8.2GA+ffmpeg+FrameEval) 1.8 0.8 ?
V100 (Colab High VRAM) (C++ TensorRT8.2GA+x264 (--opencl)+FrameEval+no tiling) 2.46 OOM (OpenCL) OOM (OpenCL)
V100 (Colab High VRAM) (C++ TensorRT8.2GA+x264+FrameEval+no tiling) 2.49 1.14 0.47
A100 (Colab) (Torch-TensorRT8.2GA+ffmpeg+FrameEval) 5.6 2.6 1.1
3090² (C++ TRT+vs_threads=20+num_threads=2+no tiling+opset14) 3.4 1.5 0.7
2x3090² (C++ TRT+vs_threads=20+num_threads=2+no tiling+opset14) 7.0 3.2 1.5
ESRGAN 2x (64mb) (23b+64nf) 480p 720p 1080p
4090 (C++ TensorRT8.4GA+ffmpeg+int8+12 vs threads+4 num_streams+fp16) ? / 6.1* ? / ? ? / ?
4090 (C++ TensorRT8.4GA+ffmpeg+int8+12 vs threads+1 num_streams+int8) ? / 17.4* ? / 7.1* ? / 3.1*

Note: The offical RealESRGAN repository uses 6b (6 blocks) for the anime model.

RealESRGAN (4x) (6b+64nf) 480p 720p 1080p
3060ti (vs+TensorRT8+ffmpeg+C++ TRT+num_streams=2) ? 1.7 0.75
V100 (Colab High RAM) (vs+TensorRT8.2GA+x264 (--opencl)+C++ TRT+num_streams=1+no tiling) 6.82 3.15 OOM (OpenCL)
V100 (Colab High RAM) (vs+TensorRT8.2GA+x264+C++ TRT+num_streams=1+no tiling) ? ? 1.39
A100 (vs+TensorRT8.2GA+x264 (--opencl)+C++ TRT+num_streams=3+no tiling) 14.65 6.74 2.76
3090² (C++ TRT+vs_threads=20+num_threads=2+no tiling+opset14) 11 4.8 2.3
2x3090² (C++ TRT+vs_threads=10+num_threads=2+no tiling+opset14) 22 9.5 4.2
4090 (C++ TensorRT8.4GA+ffmpeg+12 vs threads+1 num_streams+ffv1+opset16+fp16) 19 / 19* (2 streams) ? ?
4090 (C++ TensorRT8.4GA+ffmpeg+12 vs threads+1 num_streams+ffv1+opset16+int8) 34 (4 streams) / 50* (6 streams) ? / ? ? / 5.7* (1 stream)
4090³ (C++ TensorRT8.5+vs_threads=4+num_streams=1+fp16+(--heuristic) ? ? / 6.9* ? / 3.1*
4090³ (C++ TensorRT8.5+vs_threads=4+num_streams=1+fp16) ? ? / 6.9* ? / 3.1*
RealESRGAN (2x) (6b+64nf) 480p 720p 1080p
1070ti (vs+TensorRT8+ffmpeg+C++ TRT+num_streams=1+no tiling+opset15) 0.9 0.8 0.3
3060ti (vs+TensorRT8+ffmpeg+C++ TRT+num_streams=1) ? 3.12 1.4
V100 (Colab High RAM / 8CPU) (vs+TensorRT8.2GA+x264 (--opencl)+C++ TRT+num_streams=3+no tiling+opset15) 5.09 4.56 2.02
V100 (Colab High RAM / 8CPU) (vs+TensorRT8.2GA+ffmpeg+C++ TRT+num_streams=3+no tiling+opset15) 5.4 4.8 2.2
3090² (C++ TRT+vs_threads=20+num_threads=6+no tiling+opset16) (+dropout) 13 5.8 2.7
2x3090² (C++ TRT+vs_threads=20+num_threads=6+no tiling+opset16) (+dropout) 26 11 5.3
4090 (C++ TRT+TensorRT8.4GA+vs_threads=6+num_threads=6+no tiling+opset16+"--best") (+dropout) ? ? ? / 12*
RealESRGAN (2x) (3b+64nf+dropout)ⓘ 480p 720p 1080p
3060ti (vs+TensorRT8+ffmpeg+C++ TRT+num_streams=2) ? 5.69 2.64
V100 (Colab High RAM / 8CPU) (vs+TensorRT8.4GA+ffmpeg+C++ TRT+num_streams=4+no tiling+opset15) 10 9.4 4.2
3090² (C++ TRT+vs_threads=20+num_threads=6+no tiling+opset15) 24 11 5.2
2x3090 (C++ TRT+vs_threads=20+num_threads=6+no tiling+opset15) 51 23 10

Rife4.6 technically is fastmode=True, since contextnet/unet was removed.

Rife4+vs (fastmode False, ensemble False) 480p 720p 1080p
1070ti (vs+ffmpeg+ModifyFrame) 61 30 15
3060ti (vs+ffmpeg+ModifyFrame) ? 45 24
Rife4+vs (fastmode False, ensemble True) 480p 720p 1080p
1070ti Python (vs+ffmpeg+ModifyFrame) 27 13 9.6
1070ti C++ NCNN ? ? 10
3060ti (vs+ffmpeg+ModifyFrame) ? 36 20
3090² (CUDA+vs_threads=20) 70 52 27
3090² (C++ NCNN+vs_threads=20+ncnn_threads=8) 137 65 31
V100 (Colab) (vs+ffmpeg+ModifyFrame) 30 16 7.3
V100 (Colab High RAM) (vs+x264+ModifyFrame) 48.5 33 19.2
V100 (Colab High RAM) (vs+x264+FrameEval) 48.2 35.5 20.6
V100 (Colab High RAM) (vs+x265+FrameEval) 15.2 9.7 4.6
V100 (Colab High RAM / 8CPU) (vs+x264+C++ NCNN (7 threads)) 70 35 17
A100 (Colab) (vs+CUDA+ffmpeg+ModifyFrame) 54 39 23
A100 (Colab) (jpg+CUDA+ffmpeg+ModifyFrame) ? ? 19.92 (14 Threads)
4090 (vs+CUDA+ffmpeg+FrameEval+12 vs threads) (rife40) 61 61 36
4090 (ncnn+8 threads+12 vs threads) (rife4.0) 254 130 60
Rife4+vs (fastmode True, ensemble False) 480p 720p 1080p
1070ti Python (ffmpeg+ModifyFrame) 62 31 14
1070ti (C++ NCNN) (rife46) ? ? 30
1070ti (TensorRT8.5+num_streams=3) (rife46) ? ? 27
3060ti (CUDA+ffmpeg+ModifyFrame) ? 66 33
3090² (CUDA+ffmpeg+FrameEval+vs_threads=20) 121 80 38
3090² (C++ NCNN+vs_threads=20+ncnn_threads=8) 341 142 63
3090³ (TensorRT8.5+6 vs_threads) ? / 331.9* (9 streams) ? / 275.3* (7 streams) ? / 166.3* (7 streams)
4090 (ncnn+8 threads+12 vs threads) (rife4.0) 470 198 98
4090 (ncnn+8 threads+12 vs threads) (rife4.4) - - 98
4090 (ncnn+8 threads+12 vs threads+ffv1) (rife4.4) - - 129 / 128*
4090 (ncnn+8 threads+12 vs threads) (rife4.6) 455 215 100 / 136*
4090² (ncnn+2 threads+4 vs threads+ffmpeg (ultrafast)) (rife4.6) ? ? 164
4090 (TensorRT8.5+num_streams 8+num_threads=6+stacking method) (rife46) ? ? ? / 146*
4090 (TensorRT8.5+num_streams 8+num_threads=6+int8+ffv1+stacking method) (rife46) ? ? 123 / 156*
4090³ (TensorRT8.5+vs_threads=4+fp16) (rife46) ? ? / 541* (num_streams=14) ? / 288* (num_streams=10)
V100 (Colab) (ffmpeg+ModifyFrame) 34 17 7.6
V100 (Colab High RAM / 8CPU) (vs+x264+FrameEval) 64 43 25
V100 (Colab High RAM / 8CPU) (vs+x264+C++ NCNN (8 threads)) 136 65 29
A100 (Colab) (ffmpeg+ModifyFrame) 92 56 29
A100 (Colab/12CPU) (ncnn+8 threads+12 vs threads) (rife40) 208 103 46
A100 (Colab/12CPU) (ncnn+8 threads+12 vs threads+ffv1) (rife40) 87 97 48
6700xt (vs_trheads=4, num_threads=2) ? / 258.5* ? / 122.4* ? / 55.8*
Rife4+vs (fastmode True, ensemble True) 480p 720p 1080p
1070ti (PyTorch+ffmpeg+ModifyFrame) 41 20 9.8
1070ti (C++ NCNN) (rife46) ? ? 16
1070ti (TensorRT8.5+num_streams=2) (rife46) ? ? 14
3060ti (ffmpeg+ModifyFrame) ? 49 24
3090¹ (ffmpeg+ModifyFrame) ? 90.3 45
4090 (vs+CUDA+ffmpeg+FrameEval) (rife46) 84 80 41
4090 (ncnn+8 threads+12 vs threads) (rife4.6) 280 165 76
4090 (ncnn+8 threads+12 vs threads) (rife4.6+ffv1) 222 162 80
4090³ (TensorRT8.5+vs_threads=4+fp16) (rife46) ? 320 / 401.6* (num_streams=14) 160 / 207* (num_streams=10)
A100 (Colab/12CPU) (ncnn+8 threads+12 vs threads) (rife46) 154 86 43
A100 (Colab/12CPU) (ncnn+8 threads+12 vs threads+ffv1) (rife46) 86 86 43
6700xt (vs_trheads=4, num_threads=2) ? / 129.7* ? / 60.4* ? / 28*
  • Benchmarks made with HolyWu version with threading and partial TensorRT and without setting tactic to JIT_CONVOLUTIONS and EDGE_MASK_CONVOLUTIONS due to performance penalty. I added a modified version as a plugin to VSGAN, but I need to add enhancements to my own repo later.
GMFSS_union 480p 720p 1080p
4090 (num_threads=8, num_streams=3, RGBH, TRT8.6, matmul_precision=medium) ? ? / 44.6* ? / 15.5*
GMFSS_fortuna_union 480p 720p 1080p
4090 (num_threads=8, num_streams=3, RGBH, TRT8.6, matmul_precision=medium) ? ? / 47.7* ? / 16.5*
4090 (num_threads=8, num_streams=3, RGBH, TRT8.6, matmul_precision=medium, @torch.compile(mode="default", fullgraph=True)) ? ? / 48.4* ? / 16.7*
EGVSR (4x, interval=5) 480p 720p 1080p
1070ti 4.4 Ram OOM / 2.2* VRAM OOM
RealBasicVSR 480p 720p 1080p
1070ti 0.3 OOM OOM
A100 (Colab) 1.2 ? ?
Sepconv 480p 720p 1080p
V100 (Colab) 22 11 4.9
3090² (vs+CUDA) 30 14 6.2
CAIN (2 groups) 480p 720p 1080p
A100 (Colab) 76 47 25
3090² (vs+CUDA) 120 65 31
FILM 480p 720p 1080p
V100 (Colab High RAM) (vs+CUDA) 9.8 4.7 2.1
IFRNet (small model) 480p 720p 1080p
V100 (Colab High RAM / 8CPU) (vs+x264+FrameEval) 78 47 23
IFRNet (large model) 480p 720p 1080p
V100 (Colab High RAM / 8CPU) (vs+x264+FrameEval) ? ? 15
DPIR 480p 720p 1080p
3090¹ (TensorRT8+C++ TRT+ffmpeg+vs threads=7+num_streams=5) ? ? 16
4090 (num_streams=13+12 vs threads) 121 52 23
4090 (num_streams=13+12 vs threads+thread_queue_size) 121 54 23
4090 (num_streams=13+12 vs threads+ffv1+thread_queue_size) 121 55 25
4090 (num_streams=13+12 vs threads+ffv1+int8) ? ? 52
4090 (num_streams=13+12 vs threads+ffv1+int8+thread_queue_size) ? ? 44
SCUNet 480p 720p 1080p
4090 (12 vs threads) 10 ? ?
ST-MFNet 480p 720p 1080p
1070ti 1.6 OOM OOM

License

This code uses code from other repositories, but the code I wrote myself is under BSD3.