Ability to slightly delay camera feed display (ARKit/ARCore only)?
AndrewJDR opened this issue · 16 comments
Note, I had a brief email exchange with @blairmacintyre and he said it'd be best to open an issue here and in the computer-vision repo regarding this, along with a reference to the computer-vision issue:
immersive-web/computer-vision#1
I'll also paste the issue contents below:
We've developed a 2D/3D asset management and inspection tool that uses server-side rendering. Rendered frames are encoded to a video stream and sent back to client devices for display. This allows for dense assets to be interactive even on lower end mobile devices. In addition to standard interaction models (touch/mouse/keyboard controls), we also have an AR viewing mode. You can see the AR functionality in action to visualize a 1 billion triangle mesh at our Siggraph Real Time Live Demo at the 6:30 mark:
https://youtu.be/BQM9WyrXie4?t=389
This is happening inside chrome canary on android. You'll see at the 6:56 mark that the mesh, while workable, does have some issues with being slightly behind-sync with the real world environment. The reason it is behind sync is that this approach (and essentially all such server-side rendering approaches) must send the view transform from client to server, render a frame, encode the frame into a video stream, send that back to the client, decode, and overlay it with the AR camera display. While this turnaround can be made fairly fast (a few tens of milliseconds), the 3D content will always be slightly behind the video.
If we put aside HMDs and focus on arkit/arcore, if we only had some means of controlling when frames from the camera are displayed on the user's screen, it would make it possible to introduce a slight delay and synchronize the display of camera frames with the rendered results from the server. Is there anything planned within webxr to allow for this? Thanks.
Is there a way to predict the pose and/or timewarp the generated scene so it can be in sync with the real world?
@rcabanier I think the answer is "maybe yes?", and could involve something like MS's documented XCloud / Outatime approach (speculative rendering + timewarp when speculation fails and maybe some AI in the mix to get solid results). Simpler serviceable methods might exist, and I've pondered some, but I think they'd at least involve getting a depth buffer to the client device which is extra data I'd rather not be streaming if I can help it.
It all sounds potentially doable but is quite complex compared to introducing a small camera playout delay on the client device, so I'm thankful that this far simpler possibility exists on ARCore/ARKit.
With Hololens, Magic Leap (or Video mixed AR headsets where low latency is important as an anti-sickness measure), what you describe may be the only or best option...
If this is a common idiom for handheld devices, we should make this an option for handheld devices.
This could become an optional or required feature that an author could request during the requestSession call. I'm unsure if we should add it to the AR spec or if it should become its own mini-spec.
Also, this might be another indication that we need a handheld-ar
session mode.
Another quick note: Programmatic access to the contents of their camera output is not required for this -- only the ability to schedule the output of the camera frames to the display is necessary (within reason... since the delay buffer can't be of unlimited size). Ideally then, this wouldn't require a "This app is requesting access to your camera data Yes/No" sort of prompt to the user.
Also, while we're relatively early to this party, I do think we'll be seeing more of this kind of thing, because it's one of the very few (only?) ways of visualizing billion+ primitive datasets using handheld AR without decimation or other LOD compromises.
If you can write up an explainer with a proposed API surface, we can discuss it in the group.
I gave it a shot:
https://gist.github.com/AndrewJDR/e20ff4db3cd2c0f2409acf66da5c915a
I haven't been keeping super close tabs on the spec lately but I did my best from memory. For example, it is my current understanding that baseLayer.framebuffer in some way contains the camera's framebuffer data, and by calling glCtx.bindFrameBuffer with that baseLayer.framebuffer, that camera's framebuffer content becomes your background color, and you can then render on top of it. If this is a wrong assumption on my part, some aspects of the API I proposed won't make much sense. But it should still probably get the point across.
You probably should add it as a required/optional feature during the requestSession callback
Just wanted to check in on this one. Has there been any discussion around it? It's still something that's needed, from our perspective...
I wanted to circle back once more on this as we continue to get demand for this capability from users of our systems.
I'm going to tag the primary contacts on the WebXR AR module as shown here:
w3ctag/design-reviews#462
@Manishearth @NellWaliczek @toji @AdaRoseCannon @cwilso
For newcomers to the thread that want to catch up quickly, I've prepared an example of what this new API surface could look like here:
https://gist.github.com/AndrewJDR/e20ff4db3cd2c0f2409acf66da5c915a
Oh, misread which repo this was on, ignore the previous (now deleted) comment if you got an email.
I kinda feel like this is sufficiently complex to require a separate incubation (see the process here. This seems like a pretty major feature and it's not clear to me if there's implementor interest which is crucial to add something to an implemented spec.
The proposed API won't work: the WebXR framebuffer does not have control over the camera feed; that's composited later. You'd at least need access to camera frames from the CV incubation, and you'd need an additional API that gave control over the composited camera frame.
I know there's some interest from Google's side on exposing raw camera frames in AR, but I'm not sure if they'd be interested in adding control over what camera frames get composited.
Perhaps you should open an issue on the proposals repo and see if you can get interest there?
@Manishearth To be clear, there's no theoretical reason I need actual access to the camera feed for this to work -- an opaque handle to a frame of it could be fine, coupled with the ability to control the timing of playout for a given handle. Does that make anything easier?
CV seems to be more about access to the actual frame's contents, which typically also starts to involve user permissions and approvals, which I was really hoping to avoid, with this. That is, ideally, there'd be no need to ask the user permission to their camera, since the application wouldn't gain access to the actual camera frame data.
@AndrewJDR right, but then you're talking about some new and strange GL capabilities, making it even less likely to belong in an existing spec
I mention CV because they're looking at similar capabilities, but yes, it's not the same thing
@Manishearth Okay, thanks. I think I'll try to formulate something for the proposals repo. I could probably be an implementor for Chromium if that would help bolster the case for it.
@AndrewJDR you misunderstand, when I say implementor i don't mean a particular person willing to write the code, I mean a particular browser willing to ship it -- it would be helpful (but not a prerequisite) if you can convince a browser to work with you and agree to ship it
@Manishearth Regarding implementor: Ah understood. I think I'll wait until we have our non-browser based implementation of this finalized before filing an issue in the proposals repo, so folks have a concrete usecase they can see in action. Thanks again.