immersive-web/proposals

How to enable Custom Computer Vision for AR in WebXR

blairmacintyre opened this issue ยท 18 comments

As AR-capable WebXR implementations come online, it will hopefully be possible to do custom computer vision in the browser. Web assembly is plenty fast; the limitation is efficient access to the image/depth data + camera intrinsics + camera extrinsics from any cameras on the device, expressed in a form that is compatible with the WebXR device data. Easy synchronization with other possible sensors (e.g., accelerometers, etc).

It seems there are two ways to approach this:

  • expose the data each device frame, synchronously, through functionality in WebXR (either the Device API or a new associated API)
  • expose the data asynchronously, using different api's (e.g., extensions to Sensor and WebRTC APIs?), but in a way that supports synchronization

As AR-capable devices (existing ones exposing APIs like Windows MR, ARKit, etc., or new devices with new APIs) become more pervasive, it should be reasonable to assume that an AR-capable device has the necessary data (camera and other sensor specs in relation to the display, an API that makes efficient access possible) to power such capabilities.

While it will not be necessary for all web-based AR applications to do custom CV, a wide variety of applications will need to. This is especially true if WebXR does not provide a standard set of capabilities that all platforms must implement.

I would like to "bump" this issue, and suggest we move it to a repo to work on. I have implemented a proof of concept in the Mozilla WebXR Viewer (plus our older webxr-polyfill that supports it), and it looks quite promising. The implementation is pretty clean and simple, but there are a number of ways this could evolve.

Essentially, my implementation does this (some of which should change, probably)

  • page provides a Callback function or a Worker object for processing video. Thought here is that if they will use a worker, it might(?) be more efficient for the implementation to send it there directly? Handling both is
  • I provide a requestVideoFrame analogous to requestAnimationFrame to control frame rate. I'm not sending the function in each time, since that wouldn't make sense with the Worker. If there is no advantage to providing the Worker directly, we could just use the same structure as rAF and let the page pass the method off to the worker.
  • video is (conceptually) delivered asynchronously.
  • video frames include metadata describing camera and frame info, including timestamps relative to the rendering time stamps (will be important on HMDs where video and rendering have different frame rates)
  • The video frame includes the pose of the camera relative to some coordinate frame; I defined a method getVideoFramePose(videoFrame, poseOut) that transforms it to the current coordinate system for rendering. The idea is that the video frames may be old and thus no longer in the same coordinate frame so we need to provide a method to make them valid (since platforms like ARKit, ARCore and Hololens say you should ask, during the rendering frame, for the pose of the camera and all anchors, and these may not be valid from frame to frame). (Internally, I create platform Anchors over time and express the camera relative to one of them, and this method just adjusts based on the current Anchor pose.)

Open questions include

  • on video mixed handhelds, where the camera frames have a 1:1 relationship to the display (i.e., when the camera video is the video background, and we render on each new frame), we may want to provide a guarantee that the render function will be called right after the video frame is delivered (and not until the callback is finished), to allow for synchronous processing/rendering
  • how to deal with camera orientation on video-mixed phones, where the screen changes orientation. In response, the video should either reorient to match the display (and thus change size each time the phone rotates from landscape to portrait), or the it should stay the same. Both situations require programmers to deal with orientation in different ways (e.g., by apply rotations to the output in different cases). There is no "right" answer; for me, though, it might be a question of which way the underlying platforms deal this, so we don't do extra work if it's not needed (e.g., applying a 90 degree rotation to vision results is less work than rotating the video). In my implementation, I do NOT rotate the video, but provide properties to describe the state.
  • when and how to deal with video frame formats. Extra conversions are wasteful, so ideally we want to leave video in it's native format until conversion is needed. But, we could also indicate what the preferred format is and let programmers request in a different format if they want.

@blairmacintyre I'd like to see more conversation and agreement in this Issue about specific approaches (e.g. your work in WebXR Viewer) before spinning out a repo. It might help to write up a summary of the API you used and then ask other members to weigh in.

Can I also suggest people read the blog post I did on implementing a CV API in the WebXR Viewer? https://blog.mozvr.com/experimenting-with-computer-vision-in-webxr/

Over the course of building the samples and writing the post, my thinking evolved. I think we need to embrace some of the proposed WebRTC extensions that I linked to in that post as part of the solution. It would be great if we could get the WebRTC folks involved.

/me waves hi 8)

We've done a lot of testing lately comparing the new gl.FRAMEBUFFER -> .readPixels() pipeline to the more traditional HTMLVideoElement -> canvas.ctx.drawImageData() -> canvas.ctx.getImageData() pipeline and were surprised to find it generally seems slower.

Plus we've done quite a bit of research into performance of computer vision processing in workers and found there's a range of surprising performance penalties there too.

However, we've been able to deliver full featured Natural Feature Tracking using WASM and the more traditional HTMLVideoElement -> canvas.ctx.drawImageData() -> canvas.ctx.getImageData() that runs at about 20-30fps on mobile browsers and about 50-60fps on desktop browsers.

We've also released support for #WebXR in our SaaS content creation platform (https://awe.media) and are very interested in supporting any work that extends the API more into the computer vision space. We're also interested in a more "Extensible Web Manifesto" approach that makes the underlying Features & Points available in as efficient and raw a format as possible to foster experimentation. Plus anything that helps improve the efficiency of accessing Pixels.

It would definitely be great to get @anssiko, @huningxin & @astojilj involved - they're my other co-authors on the Media Capture Depth Stream Extensions spec. Plus they all work with Moh (at Intel) who has been doing a lot of work on OpenCV.js, etc. https://pdfs.semanticscholar.org/094d/01d9eff739dce54c73bba06e097029e6f47a.pdf

Incidentally, I'd make a strong, strong recommendation that use cases and scenarios be used to drive the explainers for new features/proposals, too. Before choosing an API shape, you need to narrow in on what problem you're trying to solve.

Work in this area should include specifying the Feature Policy controls associated with access to the camera data. The existing "camera" policy combined with other XR policies (immersive-web/webxr#308) may be sufficient.

From a vision perspective, the important things are low-level access to sensors and (on mobile phones) to drawing. The key is not to make these applications easy, but to make them possible.

Here are a few example use cases that cover several different requirements:

  • Translation app: user taps on a thing in space, computer vision detects the object, decides what it is, and tells the person the name of that object in a different language.
  • Costume app: computer vision detects a person and draws a costume over their body in 3d space in real time.
  • Animated movie poster app: trailers for movies play on top of a movie poster.
  • Holographic chess: chess pieces appear on top of a chessboard, and users move them with their hands.
  • Movie poster holograms app: digital characters appear on the floor in front of a movie poster.

Here are some things that need to be possible to satisfy these applications:

  • Camera feed
    -- Need pixel data in some format at some resolution. For us, ~480x640 1 channel luminance works, but others might want more resolution or uv data too.
    -- For efficient processing, this should be accessible as a WebglTexture in a context where GPU compute processing is ok.
  • User generated hit test gives a result that can be mapped to a point in the camera feed.
    -- Need hit test result,
    -- Need extrinsic offset from viewer / virtual camera to physical camera
    -- Need physical camera pinhole model (resolution, field of view)
  • Delayed hit testing - once an object is detected (possibly multiple animation frames later) we need to know where it is in space.
    -- Need to be able to cast a ray into the map at a position from a few frames ago (not the current camera position).
    -- Need to ray cast from every subsequent delayed detection frame to update the person / chessboard / object position.
    -- Ray cast from arbitrary ray settable on each frame is a more general case of this.
  • Delayed drawing - on mobile phones an application might need to synchronize the vision with the camera feed.
    -- Need to be able to hold an XR frame for ~2-10 animation frames before it is presented to the user with other results from vision.

Hi @nbutko I demonstrated much of what you are describing in the WebXR Viewer (see https://blog.mozvr.com/experimenting-with-computer-vision-in-webxr/ ).

There are a few more things I would add to your discussion:

  • need to think about how to move beyond CPU-bound CV. Single threaded CPU work will not be adequate on the web, esp on mobile. Need to find ways to integrate with GPU (e.g., perhaps combine this with WebGPU and other next-gen GPU for the web), and perhaps to leverage underlying platform tech (either platform processing, such as extracting feature maps; or efficient vector processing, etc).
  • we probably want to resume discussions with WebRTC folks. There has been some movement on getting per-frame metadata in WebRTC, apparently. One discussion I've been having with WebRTC folks is that (perhaps) we could have the cameras on a WebXR device appear to be available via WebRTC when the device session is open. We'd need to ensure most of the metadata you mention is available, along with time sync information between WebRTC time stamps and WebXR time stamps.
  • should take into account that all this must be asynchronous. Only on mobile will there be 1:1 relationships between video frames and rendering frames.

@blairmacintyre Do you have a link to the webrtc discussion you noted?

@nbutko alas, sorry, it was in email. I can fold you into a conversation there, if you like.

@nbutko @blairmacintyre , FYI, the WebRTC Next Version Use Cases have computer vision based funny hats that requires new capabilities including raw media accessing, processed frame insertion and off main thread processing.

Also, would like to note the nascent Shape Detection API here: https://wicg.github.io/shape-detection-api/.

Based on yesterday's CG call, I would propose that there are 4 potential use cases we should consider working on, intertwined but sufficiently separate we can talk about them separately. We may want to make these one new repo, with 4 explainers/sections.

  1. Extending WebRTC to support access of cameras on XR device sessions, and streaming/recording of video from XR devices. The scenario here is "worker wants to allow expert to see what they are seeing, and overlay augmentations in their view." This scenario applies to enterprise scenarios as well as consumer apps. For consumers, this is "home owner wants to show Home Depot consultant something, get guidance and order parts for repair". This work may need to be done over in the WebRTC WG, but we should track it here in a document with pointers, to avoid these scenarios coming up over and over. Create webrtc-remote-video.md
  2. Synchronous access to video frames in the GPU of frames from the camera on video-mixed-reality devices (e.g., phones running ARKit/ARCore). The scenario here is to do graphics effects by having access to the video in GPU memory, so the simple "overlay graphics on video" can be augmented with shadows, distortion and other effects. Potential to address privacy separately from (3) if we can arrange for it not to be possible to access video data in JS. Create video-effects.md
  3. Asynchronous access to video frames in the CPU. Asynch because most non-video-mixed devices do not run the camera and display at the same speed. The scenario here is to do real-time computer vision (e.g., SLAM like 6d.ai, 8thWall, etc are working on; CV tracking algorithms like Vuforia) in a platform independent way. Some of what will be done here might eventually make it into platforms (and already exists in some platforms), such as image detection. Others would include custom algorithms that need to work everywhere, for art, advertising, games, and so on. This is the scenario I talked about in the blog post mentioned above. Create cv-in-page.md
  4. Exposing some native, cross platform CV algorithms. The browser can expose entire algorithms running on the camera video, as suggested by the shape-detection api, which could start with very basic capabilities (like detecting barcodes in 3D, images, perhaps faces). Some of the specific algorithms could be optional, but it would be nice if there were very straightforward things (like barcodes, discussed in the shape-detection api) that could be implemented everywhere. Here we could talk about what it would be like to have some common capabilities, and how platform specific ones might be exposed (like ARKit/ARCore features, or perhaps for browsers like Argon4 that want to embed something like Vuforia). Create cv-in-browser.md
  5. There has been discussion of exposing some computer vision algorithm components, to allow native processing of video frames to be done before sending video frames into the app (either into the GPU or CPU, in 2 and 3 above), perhaps leveraging a library like Kronos' OpenVX. Essentially, we can start by thinking about this as WebVX. Being able to leverage optimized platform capabilities for doing well-known basic algorithms (image pyramids, simple feature extraction, image conversion, blur, etc) could speed up in-app CV, and also allow some of the effects that might be done in the synchronous GPU case to be done faster. Like the WebRTC discussion, if we wanted to pursue this, we would want to do that elsewhere, but the scenario has been brought up multiple times, so we should create webvx.md to summarize and record it, and point elsewhere if we pursue in.

Thats my proposal. (1) and (5) would not be worked on in depth, but would capture those use cases and point elsewhere. (2) and (3) are the most important, and share a common need to have the available cameras exposed and the developer request access to them. They can likely be done together (e.g., request cameras, direct data to CPU and/or GPU, guarantee that if the camera is synch with rAF that the data will be available before rAF and make this known if so), but are separable if we only want to tackle one first. (4) is orthogonal and can be added if we build (2) and/or (3).

This is fantastic, @blairmacintyre! I totally support the creation of this repo and the authoring of the documents you've outlined. And I'm very much looking forward to reading the proposals!

I agree with @blairmacintyre's proposal and will now create a new feature repo with the goal of writing explainers for each topic in Blair's list and then to work mainly on topics 2 and 3,

Thanks to everyone who helped us reach clarity about what the topics and goals! It took a while but we'll make better progress with this in mind.

The feature repo has been created: https://github.com/immersive-web/computer-vision

@blairmacintyre I've made you a repo admin with the assumption that you'll take the lead in putting together the initial structure and explainers as described above. If that's not a good assumption then let me know!

NOTE: Future conversations on the topic of CV for XR should happen in the computer-vision Issues and when helpful should refer to this Issue, which will remain in the proposals repo.