immersive-web/webxr-ar-module

Enable developers to distinguish between screen-based and head-worn sessions

NellWaliczek opened this issue · 26 comments

Developers will often wish to provide different interaction models for screen-based (aka hand-held) sessions and head-worn sessions. A screen-based session can comfortably accommodate UI on top of the 3D rendering's near plane, in other words, drawn in screen-space. Head-worn sessions cannot use this approach and must place interactive elements in world-space. This may end up having the same solution as immersive-web/webxr#400, but this is issue is focused on ensuring developers know when do provide which type of UI.

Careful, I think it's not safe to treat "near plane" and "screen space" as equivalent.

Mathematically, the projection matrix is derived based on intersecting rays with the near plane, but this doesn't mean that this corresponds to the distance to an actual screen. For example, a zSpace-style 3D monitor is able to display objects that appear in front of the monitor surface, and this requires the near plane to be closer to the eye than the physical screen. If an app tries to draw UI at the near plane distance, this may look odd on such a system. As far as I know, the UI quality would be highest if it's drawn at the actual screen distance with no extra parallax.

Also, even if an app primarily targets a hand-held use case, there seems to be a risk of unfortunate breakage on HMDs if the API makes bad assumptions due to this - drawing UI around the near plane on an HMD is likely to be uncomfortably close.

How about a per-frame property where the UA provides a suggested UI rectangle based on app-provided hints?

For example, the app could request a "largest readable area" quad which would be fullscreen on a hand-held session, a generously-sized floating quad for an HMD (while not extending into blurry peripheral vision), or the actual screen position for a zSpace-style 3D monitor or diorama mode.

Separately, it could express how important it is to keep this visible, with the UA choosing appropriate placement. The same hints could be used for a separate DOM layer as per issue immersive-web/webxr#400, but I think it would also be useful without layer support.

How about a per-frame property where the UA provides a suggested UI rectangle based on app-provided hints?

I like this option -- In addition to providing a safer abstraction of the hardware differences, it could further enable accessibility use cases by letting the UA take into consider various constraints set by the user.

Potentially related to the work now happening in https://github.com/immersive-web/layers

I thought about this a bunch and discussed this with Nell.

The UI rectangle is a nice idea but we still need to define the space it is anchored to, so this can be added incrementally if we want. Furthermore, the concept seems a bit more nebulous for world-space AR devices, where you may want to have the UI somewhere on the near clip plane, but you might also attach UI to e.g. someone's wrist, or have it float elsewhere. This seems to be more of a case of application choice.

I think an enum enum XRInteractionSpace {"screen-space", "world-space"} would be useful. I don't really have a good name for this enum, my reasoning for this one is "the space interactions happen in", but I hope others can come up with better names.

There's a potential for a device which supports both interaction modes: really, a handheld AR device technically supports world-space interactions (but it's uncomfortable). We should probably make this the preferred interaction space. There's a chance there are devices which equally support both and wish to leave the choice up to the application, however, so we might want a "both" enum for this case. I am leaning on the side of not doing this.

As for specifying a rect:

We could potentially have an InteractionRect on each frame. Perhaps it should be an XRSpace such that the interaction is the xy plane (in most cases this will end up being viewer-space + an offset, which is easy to compute), along with top/left/bottom/right bounds?

interface InteractionRect {
  readonly attribute XRSpace origin;
  readonly attribute float top;
  readonly attribute float right;
  readonly attribute float bottom;
  readonly attribute float left;
}

This would require introducing the concept of a screen-space XRSpace, which cannot be related to other XRSpaces, though (unless we're introducing this same kind of space for XRInputSources?). We could also signal this through a nullable space.

interface InteractionRect {
  readonly attribute XRSpace? origin;
  readonly attribute float top;
  readonly attribute float right;
  readonly attribute float bottom;
  readonly attribute float left;
}

This is a good start! I'm spitballing here... but what about something like this? It follows the patter established by XRReferenceSpace. Depending on what else is out there in existing specs, could separate out left/right/top/bottom like you did as well. Either way, it occurs to me that using XRRigidTransforms might make certain math made easier. But also including info about the screen space dimensions is super important.

Either way, this opens a few questions...Should the XRSpace be located be at the center? Or maybe the upper left?

Let's make the next step in figuring out the design a sketch of sample code that would use the api. I tend to find that it really draws out structural issues in the api design :)

interface XRScreenSpace extends XRSpace {
  readonly attribute XRRigidTransform upperLeft;
  readonly attribute XRRigidTransform lowerRight;
  readonly attribute float pixelWidth;
  readonly attribute float pixelHeight;
}

partial interface XRSession {
  readonly attribute XRScreenSpace? screenSpace;
}

Oh, also, we should consider that an XRInputSource of type 'screen' could maybe share the same XRSpace as an XRScreenSpace?

Oh, also, we should consider that an XRInputSource of type 'screen' could make share the same XRSpace as an XRScreenSpace?

Yeah, this was something I was thinking about too, but I felt this might be too major a change to add to the main spec right now. If we're open to it, though, I'm very much for this, doing touch input for inline sessions and handheld devices in world space is super weird.

I think the ideal situation is:

  • Sessions with a screen have a XRSession.screenSpace like above
  • Input events return spaces in screen space
  • screenSpace can only be related to screen-space input spaces and vice versa
  • it may be useful to allow viewerSpace to be related to screenSpace/screen-space inputs, but I'm not sure if there's a major benefit.

This doesn't quite solve the problem, though, everything in the spec is in meters, and when dealing with screens we really want stuff to be in pixels. I'm not really sure how to solve that, without making screen-space input events carry additional pixel information.

(We're stumbling on the issue where while we support changing the origin of a reference space, we don't quite support changing the basis of a reference space)

(It might be worth moving the screen space input stuff to the main webxr repo, and then focusing this on just adding an enum)

This doesn't quite solve the problem, though, everything in the spec is in meters, and when dealing with screens we really want stuff to be in pixels. I'm not really sure how to solve that, without making screen-space input events carry additional pixel information.

I guess providing screenspace bounds in meters solves this problem since the viewport is in pixels and developers can do math.

I think we may be talking past each other on this one. For now, can we start with some sample code on how developers will use this?

Yeah, this was something I was thinking about too, but I felt this might be too major a change to add to the main spec right now. If we're open to it, though, I'm very much for this, doing touch input for inline sessions and handheld devices in world space is super weird.

Would it really be that weird to report a gripSpace? I mean the name sucks for sure (which i'm entirely to blame for)

(It might be worth moving the screen space input stuff to the main webxr repo, and then focusing this on just adding an enum)

I guess providing screenspace bounds in meters solves this problem since the viewport is in pixels and developers can do math.

We can make additive changes from this spec. If it makes sense to extend the select event to have an optional x/y screenspace data, that is something we can explore.

Either way, let's switch gears to sample code to work through how developers are hoping to use this functionality and we can come back to idl proposals in a bit.

Would it really be that weird to report a gripSpace? I mean the name sucks for sure (which i'm entirely to blame for)

It ... feels weird? But it could make sense!

If it makes sense to extend the select event to have an optional x/y screenspace data

Yeah this would be nice.

I'd imagine individual developers might not use this enum as much, but frameworks might. I'm imagining that there would be an API where you create Button objects (with a size/position in pixels? meters? percentages?), and the framework places them accordingly and routes events if necessary.

// each frame
if (session.interactionSpace == "screen-space") {
   for (button of buttons) {
      // draw each UI element directly in screen coordinates, likely with a different vertex shader
   }
} else {
   for (button of buttons) {
     // draw each UI element in a rect close to the near clip plane
   }
}

// during setup
input.onselect = function(e) {
   if (session.interactionSpace == "screen-space" && input.targetRayMode == "screen") {
      for (button of buttons) {
         if (hitTest(button, e)) {
            e.cancel();
            button.dispatchEvent(new Event("xr-clicked"));
         }
      } 
   } else if (session.interactionSpace == "world-space" && input.targetRayMode != "screen") {
     // roughly the same code
   }
}

Still thinking about how the proposed rect thing could be used in the drawing step.

@Manishearth:

  • Input events return spaces in screen space

I'm not sure I follow this. Any individual space that represents a target ray is not itself "in" screen-space or world-space - it's only when you locate that ray space relative to some other space that you'd obtain coordinates that feel like "screen space" or "world space" coordinates.

@Manishearth:

  • screenSpace can only be related to screen-space input spaces and vice versa

There are key scenarios for screen taps to cause both screen-space interactions and world-space interactions, even on a phone/tablet form factor:

  • Screen-space interactions allow for screen-space UI buttons
  • World-space interactions allow the user to tap on world objects to manipulate or place against them

If we did return screen tap input events as "screen-space"-only spaces, it would prevent apps from using targetRaySpace to raycast against the world. Hopefully, such a restriction shouldn't be necessary, since a key goal of spaces is giving apps the freedom to relate spaces as they see fit, obtaining coordinates of things like target rays relative to whichever base space is most meaningful to the app.

Yeah, I think i overcorrected in that comment. You're right, we need this for rays and the viewport/projection information should be enough to allow the UA to relate between world and screen space anyway.

We discussed this a bunch in the meeting this week. Ultimately, we really want to expose "how should I do UI on this device?" It seems like there are two potential API designs moving forward.

The first is to expose an interactionSpace on XRSession:

enum InteractionSpace {
    "screen-space",
    "world-space",
}

This is a hint to the user/engine developer where to place UI buttons, and basically maps to handheld and headworn respectively.

The second is to make this contingent on DOM Overlay a bit early. This API better answers the question of "how should I do UI" by nudging developers towards DOM overlay when supported, creating a pit of success:

enum OverlayType {
    "screen-space",
    "emulated",
    "unsupported"
}

where "screen-space" means that the overlay will show up in screen space, "emulated" means that it will be displayed in world space in some kind of floaty window (this is the compatability mode for DOM overlay on headworn devices), and "unsupported" means that DOM overlay isn't supported. We can require AR module implementors to expose this enum without having to support the rest of DOM overlay.

The latter proposal seems to be a bit more future proof, but it does seem a bit weird to have in a spec where DOM overlay doesn't otherwise exist.

It's worth noting that currently the DOM overlay proposal relies on requestFullscreen() and can't be feature-detected normally. You can require it via requiredFeatures, but there isn't really an option for code that wishes to do UI and has a fallback path for user agents that don't support DOM overlay (since you don't know if optional features were granted)

I admit I'm not a huge fan of solving problems by inferring something from another feature. So, I would be much more in favor of your "InteractionSpace" API: it gives the information we're discussing here in a direct way.

OverlayType "probably" implies the same thing, but what happens in this case:

  • I have a screen-based browser that supports WebXR but chooses (for whatever reason) to not support DOM overlay?

The best approach would be for the developer to use WebGL to create screen-space widgets. The former supports the developer detecting this situation; the later does not.

@blairmacintyre the problem with the interactionspace API is that it creates a pit of failure for devices that support DOM overlay via emulation: devs might read it as "oh, world space, time to draw my own world space UI" instead of using dom overlay

the case you bring up does seem relevant, though.

perhaps the best option is two enums? but that also has the same pit of failure.

Ideally, this API nudges people towards DOM overlay whenever possible.

I wasn't clear: I didn't mean we should do one over the other, I meant we should do both. They tell different things, and they tell it directly, succinctly and clearly. Do both.

Ideally, this API nudges people towards DOM overlay whenever possible.

I don't agree with this. DOM overlay is there to make it possible for web devs to leverage the DOM, if they want. But there is no particular reason that's better than implementing your own screen-aligned UI in WebGL.

I would say, instead: "Ideally, these two APIs give developers the information they need to make the best choice for their application."

I guess so your proposal is that we do the following:

require InteractionSpace:

enum InteractionSpace {
    "screen-space",
    "world-space",
}

A future hypothetical overlay module can require the following:

enum OverlayType {
    "screen-space",
    "emulated",
    "unsupported"
}

(we may just replace the unsupported enum value with a null state)

That could work.


I think the pit of success is still important here, if there is a way to do buttons really nicely, especially with the emulated overlay UI (which will probably be better than what you might ad-hoc design), we'd like to nudge people towards it, but of course give them the option to build their own controls from the bottom up if they want.

I think the pit of success is still important here, if there is a way to do buttons really nicely, especially with the emulated overlay UI (which will probably be better than what you might ad-hoc design), we'd like to nudge people towards it, but of course give them the option to build their own controls from the bottom up if they want.

I don't think you should focus on "ad-hoc design" here, it's orthogonal. If WebXR becomes widely used, toolkits that implement beautiful widgets will eventually be created. Right now, it is definitely easier to create beautiful 2D widgets in the DOM, but I honestly see no reason to assume that will always be the case. (And FWIW, my person experience is that it's also pretty darn easy to create awful UIs in the DOM).

The key point is that developers should have choice. Especially when we think about WebXR apps that are just one part of a larger web application that includes 2D content, it's pretty important for us to support mixing DOM and 3D. But it's not "just" about nice widgets, it's about leveraging development efforts across multiple modalities when appropriate.

A key clarification here is that overlay != DOM overlay. The GL-based screen-space UI that you mentioned does not need to be limited to DOM content (in fact if we were to get really pedantic you could use a canvas context in a dom overlay, but I digress). By forcing rendering into the main buffer we prevent UAs from make the best choice for the device they are on.