w3c/machine-learning-workshop

Memory copies

Opened this issue · 4 comments

aboba commented

In the presentation on "Machine Learning and Web Media" , reference is made to the need for efficiency. Today machine learning applications operating within the browser media pipeline will trigger many additional memory copies compared with native applications due to the following considerations:

  1. Although QUIC is implemented in user space and there are zero copy implementations, currently browser implementations copy memory when moving data between C++ and Javascript. Copies are not eliminated by use of BYOB readers/writers in WHATWG streams (e.g. data is not written or read directly from the provided buffers).

  2. Handoffs between JS and WebAssembly also result in a memory copy.

  3. Another memory copy may occur when using a Transferrable Stream.

  4. More memory copies can occur when rendering video if zero-copy rasterizer flags are not enabled.

In the Interactive ML talk from Tero, he mentioned that in one of the high framerate scenarios they're working where (generating music from the camera feed of video frames), the cost of moving the data around is as high as the cost of ML inference itself. This is very true when a lot of copies and CPU/GPU uploads/downloads occur, especially around webcam media and video frames where the data must be read first into the canvas, and then copy out and upload. And this is only in a good case. The worst case could involve multiple roundtrips. It would be interesting to explore if there could be a more direct way a video frame from the camera can be fed more directly to ML without too many intermediaries.

Cross-referencing #97 that discusses the overall Action-Response Cycle and its various bottlenecks as outlined in @teropa's talk. My understanding is (unnecessary, if we had better APIs) memory copies cause bottlenecks in this cycle when pixels are drawn to a canvas from which the pixel tensor is built from. I think with better API integration we could get rid of that bottleneck and avoid crossing the CPU/GPU boundary unnecessarily.

"Media integration" e.g. fast streaming inputs from MediaStream is proposed as one possible solution by @teropa which I think is a similar idea to @wchao1115's "a more direct way a video frame from the camera can be fed more directly to ML without too many intermediaries".

In the past year, we did some POC of Real-Time video processing based on WebRTC and machine leaning, including video super resolution.

As videos in most web engines are processed with hardware and video frames are stored as gpu textures, they can be processed with WebGL. The WebGL API gl.texSubImage2D(... , <video element>) will reuse video textures in browser internally by zero-copy. But we found that most web ML frameworks, including tfjs, only recieves inputs and provides outputs with ArrayBuffer, although they have WebGL backend.

To improve the performance, we forked tfjs and added an new input interface to it, which creates texture tensor with <video> element directly, and also another output interface to expose the internal texture result to web app, so that web app can send

If web ML frameworks suport WebGL texture inputs directly and could output texture predict result to app, we can avoid CPU <-> GPU memory copy with js, GPU pipeline would not be interrupted. The performance of realtime video processing would be better.

As mentioned during live session #1, Dom an I propose to continue discussions during a virtual breakout session at TPAC. I proposed a session on Memory copies & zero-copy operations on the Web on the dedicated Wiki page and will reach out to hopefully interested parties. Goals are to explore needs to copy memory in various Web technologies (JS, WebAssembly, WebGPU, Machine Learning, WebRTC, Media) and identify possible architectural updates to the Web Platform that could help reduce unneeded memory copies.