nagadomi/nunif

[unlimited:waifu2x] Multithreading is possible but not configured properly

LoganDark opened this issue · 13 comments

Problem

ONNX runtime supports multithreaded model execution, and it will automatically be enabled.

However, that can only happen when SharedArrayBuffer is available, which requires these HTTP headers to be set:

  • Cross-Origin-Embedder-Policy: require-corp
  • Cross-Origin-Opener-Policy: same-origin

https://unlimited.waifu2x.net does not send these headers, so ONNX runtime cannot use multiple threads. I will perform an experiment to show that this is a mistake.

Experiment

I will add these headers for testing by using a Chrome extension.

image

These headers will make SharedArrayBuffer available, and ONNX runtime will automatically use multiple threads.

Parameters for the experiment

  • Model: swin_unet.art_scan
  • Denoise: 3 (highest)
  • Scale: 1 (1x)
  • Tile size: 256 (console: tile size = 256)
  • TTA level: 0 (disabled)
  • Detect alpha: false (no alpha channel)
  • Size of the image: 42 tiles

Performed using the version of unlimited:waifu2x that is currently live at https://unlimited.waifu2x.net.

Result of the experiment

Chromium

  • 1 main thread that performs the execution (no changes)

    1 thread

    388556.5769042969 ms (approx. 9251.347069149926 ms per tile)

  • 12 worker threads that perform the execution (with headers)

    12 threads

    143964.38818359375 ms (approx. 3427.723528180805 ms per tile)

Using 12 threads divides the time taken by 2.698977030408252, a 2.7x improvement.

Firefox

  • 1 main thread that performs the execution (no changes)

    image

    DNF (slow); 109955ms for 3 tiles; estimated 1539370ms for 42 tiles (approx. 36651ms per tile)
    169983ms (approx. 4047.214285714286ms per tile)

  • 12 worker threads that perform the execution (enabled dom.postMessage.sharedArrayBuffer.bypassCOOP_COEP.insecure.enabled)

    image

    DNF (slow); 147402ms for 18 tiles; estimated 343938ms for 42 tiles (approx. 8189ms per tile)
    58643ms (approx. 1396.261904761905ms per tile)

Using 12 threads divides the time taken by 4.475719461065657 2.898606824343911, a 2.9x improvement, even larger than Chromium.

Implementation steps

  • Instruct the server to send the required HTTP headers
  • Define ort.env.wasm.numThreads = navigator.hardwareConcurrency before initialization, or else it will default to only 4 threads
  • Enjoy the free speedup

It's astonishing how fast unlimited:waifu2x can get in Firefox with 12 threads. Seems Firefox really is the best at WebAssembly JIT.

It's possible to make it even faster by making the models compatible with ONNX runtime's WebGL or WebGPU backends, so that they can be executed on the GPU, just like with CUDA. In fact, the WebGPU backend might already be compatible (but I have not looked into this yet)

Compatibility mostly consists of removing operators that WebGL doesn't support, like ConstantOfShape. Some optimizers can already recognize and remove these. utils/pad.onnx pictured below, official on left, optimized on right:

Screenshot 2023-05-04 212503

But you also have to adjust int64 values so that they fit in 32 bits (utils/alpha_border_padding.onnx pictured below):

image

This can be done manually in a python debugger (as I have successfully done for some models). And not all the actual int64 types have to be converted to int32 (although the int64 casts need to be removed), they just need to fit in an int32.

I have successfully gotten some of the utility models to load in ONNX runtime's WebGL backend, but unfortunately, this isn't very useful because the utility models are mostly precalculations, and the most expensive part is the actual upscaling model, which uses operators that WebGL doesn't support, namely ConstantOfShape, Where, Expand.

I'm also looking into seeing if I can actually add support for ConstantOfShape into the WebGL backend myself, but of course this is not very easy since I cannot build ONNX runtime from source yet. Maybe I will modify the minified JS (hehehe....). My personal version of unlimited:waifu2x is based on a TypeScript translation/rewrite of reverse engineered minified code.

Thank you for sharing.

Multithreading
Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Opener-Policy: same-origin

I tried this before but did not applied it as it was slower than the original code on chrome.
(I first tried ort.env.wasm.numThreads=4 but it didn't seem to work, so I tried microsoft/onnxruntime#9681 )
I may need to try again.

WebGL

I gave up on using WebGL backend because of the many unsupported functions.
(int32 conversion was possible with a slight modification of https://github.com/aadhithya/onnx-typecast )

WebGPU

I tried it recently but it did not work yet. microsoft/onnxruntime#15796


As of now, I am hoping to get WebGPU backend to work.
So I think that WebGL backend does not have to work.
It would be nice to have WebAssembly backend faster for users who don't have a GPU.

I tried this before but did not applied it as it was slower than the original code on chrome.

This is clearly not true anymore

(int32 conversion was possible with a slight modification of https://github.com/aadhithya/onnx-typecast )

I also tried modifying that script. But int32 conversion is not required, only reducing magnitude of the values. And full conversion causes the model to fail validation anyway, because some operators require int64 attributes.

I tried it recently but it did not work yet. microsoft/onnxruntime#15796

Good to know~

As of now, I am hoping to get WebGPU backend to work.
So I think that WebGL backend does not have to work.

You're right, it doesn't. This issue itself is about WASM multithreading, not WebGL (that was just a slightly related comment).

It would be nice to have WebAssembly backend faster for users who don't have a GPU.

Absolutely

Also pytorch version (cli/server and training) is running on 16-bit float (half float).
If 16-bit float can be used in some way, it can be faster without degradation. However, when I previously investigated it, it seemed difficult to use it in JavaScript.

If 16-bit float can be used in some way, it can be faster without degradation. However, when I previously investigated it, it seemed difficult to use it in JavaScript.

It should be sufficient to convert the input tensor to float16 and back each time you run the model. You can probably use these converter functions and use Uint16Array tensors as float16. Then use a model that expects float16. I will probably perform my own experiments once my codebase is functional

OK, I have confirmed that it is faster with multithreading.

// google-chrome
// default (test on unlimited.waifu2x.net)
tile size = 256
script.js:38 render: 38275.5 ms
tile size = 256
script.js:38 render: 38714.466064453125 ms

// numThreads=16 (test on localhost)
tile size = 256
script.js:489 render: 12700.81005859375 ms
tile size = 256
script.js:489 render: 12487.656005859375 ms

// Firefox
// default
tile size = 256 script.js:28:27
render: 35854ms - タイマー終了
tile size = 256 script.js:28:27
render: 35756ms - タイマー終了

// numThreads=16
tile size = 256 script.js:362:17
render: 12039.94ms - タイマー終了
tile size = 256 script.js:362:17
render: 11316.12ms - タイマー終了

I may have made that mistake before, as it gets very slow when DevTools is open.

One thing that is not great is that all javascript files must be hosted locally to enable SharedArrayBuffer.

all javascript files must be hosted locally

You mean vendored (on your server that serves the correct HTTP headers)? You should have been doing that anyway. You should not depend on CDNs for your website's main functionality. You host the models on your server so why not host the runtime to execute them?

Ahh, I remember. One of the reasons I didn't use it is because it would not work with Google Analytics or Adsense.

Ahh, I remember. One of the reasons I didn't use it is because it would not work with Google Analytics or Adsense.

Why not? Can't you vendor those scripts as well?

If you are ok with only being compatible with Chrome 96 and higher, setting Cross-Origin-Embedder-Policy: credentialless should work to keep google ads functional.

https://chromestatus.com/feature/4918234241302528

The header works on my chrome and SharedArrayBuffer exists with it.

But this does not enable multithreading in firefox (firefox does not support it).

Also try adding the crossorigin attribute to the script tag, it probably won't work but is worth a try maybe.

For now, I have not been able to get Adsense to work with cross-origin isolation env.
I registered the website to Chrome Origin Trials (SharedArrayBuffer) and it works on chrome.

Is there any way to get firefox support as well?