A CaptureController object for getDisplayMedia()
Closed this issue · 42 comments
As promised in https://www.w3.org/2022/05/17-webrtc-minutes.html#t05 a dedicated controller object for getDisplayMedia.
An initial version would be aimed at solving #190 for now:
const controller = new CaptureController();
const stream = await navigator.mediaDevices.getDisplayMedia({controller});
// Default is no focus when a controller is associated.
if (focus) {
await controller.focus();
}
The controller is associated 1 ←→ 1 upon getDisplayMedia success, and cannot be re-associated after that.
- Presence of the controller suppresses the automatic bring-to-front with focus that happens today.
- Without a controller, getDisplayMedia would bring-to-front with focus like today for backwards compatibility.
- Apps may call focus() anytime ahead of getDisplayMedia or right after it (upto 1 second after seems fine, but let’s iterate on details). Rejects with InvalidStateError after a timeout or if capture has stopped, whichever is sooner.
- focus() resolves after the captured window or tab has been brought to front and focused, or asap for monitor display surfaces, but no earlier than gDM success.
The controller is associated 1 ←→ 1 upon getDisplayMedia success, and cannot be re-associated after that.
I do not see much value in reusing controller objects.
It would be simpler if the controller is associated upon getDisplayMedia being called (except in case of synchronous exception). That would cover the case of getDisplayMedia being called twice in a row for instance. First one would proceed, second one would reject synchronously.
1. Presence of the controller suppresses the automatic bring-to-front with focus that happens today.
Seems fine given above caveat.
2. Without a controller, getDisplayMedia would bring-to-front with focus like today for backwards compatibility.
This is UA territory, I am not sure the current spec says anything about this.
A note in the spec would probably be good here.
3. Apps may call focus() anytime ahead of getDisplayMedia or right after it (upto 1 second after seems fine, but let’s iterate on details)
I am not a big fan of a timer here.
First we might enter UA territory, with different timeout values.
Second, if you leave something like 1 second, people might want to go to network (say with Capture Handle identity) to decide what to do. This might lead to races between the timer and networking.
Maybe we should start with something restrictive and relax to a timer approach if we see good use cases for it.
Or we could require an explicit decision from the capturer and delay video frame delivery until the decision is made.
focus() would then take a boolean parameter to express the decision of the capturer.
We would delay delivery of video frames until the focus promise is settled.
4. focus() resolves after the captured window or tab has been brought to front and focused, or asap for monitor display surfaces, but no earlier than gDM success.
This seems fine. That means getDisplayMedia promise resolves first, then focus promise.
In case focus() has been called before calling getDisplayMedia and getDisplayMedia fails, focus() should probably reject too with some kind of error. This aligns with the idea to associate controller and getDisplayMedia request at getDisplayMedia call time.
Some additional thoughts:
- controller.focus() is a bit unclear what is being focused, capturee or capturer. Any idea how to make the name clearer?
- It might make sense to allow UA to override the page asking not to focus. This is probably important to state something like this given this is observable to JS. focus could reject with an error of type NotAllowedError for instance.
- In case UA implements a dynamic "change the surface" picker, should there be a possibility for the web application to decide again whether focus happens?
It is also worth pointing out that not focusing increases the risk of mis-selection by the user, so particular care should be brought there by UA to limit this risk.
To clarify the tone of this message - I think we're on a similar page overall. I am supportive of the controller approach as a good compromise, I want to use it to reach consensus on Conditional Focus, and I'd like us to make rapid progress over the shape. My comments below are to that effect.
const stream = await navigator.mediaDevices.getDisplayMedia({controller});
Would this be getDisplayMedia({video: ..., controller: controller})
?
await controller.focus();
Do we really want to inform the capturer that the capturee has gained focus?
The capturing application can observe that it has, itself, lost focus. But it can't tell if the user has alt-tab
-ed to another native app, switched to another tab, etc. And I think that's enough.
Maybe you intention here is to convey an error message back to the application when focus() fails? Would this be actionable and useful enough to merit the complexity and privacy-concerns entailed?
Without a controller, getDisplayMedia would bring-to-front with focus like today for backwards compatibility.
On the one hand, I really like how this requires an explicit decision from new code while still retaining the established UA behavior old code. Well done there!
On the other hand, I am concerned about specifying defaults.
We have two cases to discuss defaults for:
- getDisplayMedia() is called without a controller.
- getDisplayMedia() is called with a controller, but the application either fails to call focus(), or does so unreasonably late.
I think we should start out leaving the defaults here up to the UA, and circle back to this if we manage to get consensus. This is reasonable because:
- Without a controller - the status quo is maintained. No new commitments, but in practice we all do the same (Chromium and Firefox both default to focusing; Safari has not yet implemented).
- With a controller, but failing a timely call to focus() - I think an MVP can reasonably neglect to specify it.
a. Applications that forget to callfocus()
are buggy and should be fixed.
b. Applications that callfocus()
, but CPU hiccups render the call too late - this will be very uncommon, and inconsistent cross-browser behavior here is an insignificant concern to developers and users.
This is UA territory, I am not sure the current spec says anything about this.
A note in the spec would probably be good here.
If I am reading @youennf correctly here, then we're in agreement; see above.
I am not a big fan of a timer here.
The timer is a necessary evil, to remove the sting from this hypothetical malicious code:
await ...getDisplayMedia({video: ..., controller: controller});
busyWait(/*seconds=*/5);
controler.setFocusBehavior("focus-captured-surface");
Note that it doesn't have to be malicious. If a CPU hiccup or buggy code happens to cause the shared surface to be focused at an arbitrary, unexpected point in the future, it will surprise and annoy the user. It might also produce an unintended interaction with the focused surface, that might have been intended for another page.
Speaking of which - we should also require that focus() be rendered no-op if the capturing page is not the focused one by the time focus() is evaluated. I believe my current draft of Conditional Focus does that, but otherwise, we could change it.
controller.focus() is a bit unclear what is being focused, capturee or capturer. Any idea how to make the name clearer?
I agree that there is unclarity here. Chrome has gone with the following, and I think it's a good place to start:
// Previously on the track itself, now on the controller:
undefined focus(CaptureStartFocusBehavior focus_behavior);
enum CaptureStartFocusBehavior {
"focus-captured-surface",
"no-focus-change"
};
I'd just change focus()
to setFocusBehavior()
and I think we're set. We'll also have a credible path, post-MVP, to allowing focus-behavior to be set before getDisplayMedia(), which Jan-Ivar was previously interested in, so that should be good too.
It might make sense to allow UA to override the page asking not to focus. This is probably important to state something like this given this is observable to JS. focus could reject with an error of type NotAllowedError for instance.
I'm OK with that.
It is also worth pointing out that not focusing increases the risk of mis-selection by the user, so particular care should be brought there by UA to limit this risk.
There are two risks here:
- The risk that mis-selection will occur.
- The risk once mis-selection has occurred.
Chrome's approach to this problem was to reduce the first risk. We did that by introducing a preview for tabs (we'd already had them for windows and screens). With that in place, the second risk has become far less relevant.
Glad to hear support for the controller model! Totally agree that in absence of the controller a spec note seems fine.
It would be simpler if the controller is associated upon getDisplayMedia being called (except in case of synchronous exception).
In case focus() has been called before calling getDisplayMedia and getDisplayMedia fails, focus() should probably reject too with some kind of error.
I like it. And synchronous exceptions aren't possible from async API methods, so we don't have to worry about those.
I am not a big fan of a timer here. ... This might lead to races between the timer and networking.
Maybe we should start with something restrictive and relax to a timer approach if we see good use cases for it.
I feel we rabbit-holed in #190 and failed to find anything more restrictive that didn't trade one problem for another.
But from some of the comments I worry I've undersold the model shift when a controller is present:
The focus decision is untied from and no longer orchestrated by getDisplayMedia resolution. This gets us away from the microsecond timescale and questions of defaults.
Instead, focus is explicitly executed by a separate focus()
method where, much like a transient activation-consuming API, a relevant timescale of half-seconds seems fine, because what matters is the time between end-user action (hitting "Share" in the prompt) and when a window becoming focused will still be perceived as its reaction without it catching end-users off guard. Such decisions have traditionally been user agent ones.
Defaults, settable states, and media output blocking seem a poor fit in this model. If we're to go with a method model then we should lean into its advantages of separation. Otherwise we should probably go with a different model.
controller.focus() is a bit unclear what is being focused
In my mind, a captureController controls the capture so captureController.focus() focuses the capture, which seems clear to me, and I like short names, but we can always bikeshed.
It might make sense to allow UA to override the page asking not to focus.
I think this would defeat use cases such as screen recording. My working assumption as presented has been that screen-capture isn't solely a tool for video conference presentations, and that the current auto-focusing done by browsers amounts to a layer violation. Any perceived privacy guarantees from existing browser behavior here probably deserves to be challenged. I'm happy to beef up privacy indicator requirements, but we should probably discuss that in #211 to avoid this getting long.
Do we really want to inform the capturer that the capturee has gained focus? ... privacy-concerns entailed?
Privacy concerns seem no different to me from today where an app knows that if the resulting displaySurface
is "window"
or “browser”
then the captured surface has been focused (or will be focused in ~1 second) 99% of the time in all browsers that support that surface type. I'm not sure what exploit is thwarted by the lack of accuracy that a promise would provide here. At the same time, I don't have a specific use case in mind, and agree it is not a requirement, so I'm open to discuss, but am curious about reasons to resolve the promise prematurely.
The method model does offer us a way to convey failure to focus for whatever reason. Might that be of interest to video conferencing sites?
I also agree that previews are the solution to mis-selection.
const stream = await navigator.mediaDevices.getDisplayMedia({controller});Would this be getDisplayMedia({video: ..., controller: controller})?
Yes, both work since video defaults to true. We're missing a WPT test for that though...
It seems we have two paths here.
-
focus() is a one time thing.
This option is more capturer saying "I made a decision to focus or not to focus".
In that case, I would prefer that web site be explicit about either 'I focus' or 'I do not focus'. -
focus() is a way to focus capturee, not just after getDisplayMedia but at any point in time.
It does not seem bad for capturer to provide UI to quickly go to capturee.
In that case, we could gate it by standard user activation, and we could say that getDisplayMedia prompt selection (if successful) is a user activation.
AIUI, current proposal is close to 2 but with the constraint that focus() is called only once around the time getDisplayMedia is called/resolved.
I think this would defeat use cases such as screen recording. My working assumption as presented has been that screen-capture isn't solely a tool for video conference presentations
My understanding was that it is difficult to predict what VC apps want.
For screen recording applications, isn't it the case that they will always want one or the other, irrespective of the selected surface, something like:
- Always focus if starting to record right away
- Do not focus at getDisplayMediaTime. Try to focus when user presses the record button (which will probably happen a few seconds after getDisplayMedia resolution time).
I also agree that previews are the solution to mis-selection.
This is not sufficient. An attacker can make two pages looking exactly the same (previews will not help there) and get to a point where capturer is capturing a surface that it can navigate while continuing to capture. Unfocusing makes this attack a little bit easier (attacker will no longer need to trick user in unfocusing the capturee).
- focus() is a one time thing.
This option is more capturer saying "I made a decision to focus or not to focus".
In that case, I would prefer that web site be explicit about either 'I focus' or 'I do not focus'.
That's the version I'm after, and for that reason, I want an explicit decision - setFocusBehavior(FocusBehaviorEnum)
.
It does not seem bad for capturer to provide UI to quickly go to capturee.
In that case, we could gate it by standard user activation, and we could say that getDisplayMedia prompt selection (if successful) is a user activation.
Vendors of video conferencing software have requested this functionality (this = focus-capturee at an arbitrary time). At the moment, I don't think we can accommodate them, because the risks outweighing the rewards. (Note that user activation is insufficient - tricking the user to double-click at arbitrary coordinates on the capturer then becomes equivalent to tricking them into single-clicking at the respective coordinates of the captured surface.)
Focus at an arbitrary moment is out of scope for me at this time.
At the same time, I don't have a specific use case in mind
That's my situation too. There's significant overhead to adding this functionality (this = Promises that resolves when focus-change happens).
- Spec authors need to (1) evaluate the risk and (2) specify.
- Implementers (3) have more work when implementing.
- Someone has to (4) document this additional functionality for Web-developers.
- Web-developers (5) need to read longer documentation and (6) produce more involved code.
A use case has to motivate all of this extra work. I have thought about it and could not find it.
The method model does offer us a way to convey failure to focus for whatever reason. Might that be of interest to video conferencing sites?
I don't think it's actionable enough to justify the costs. To hedge our bet, if this changes in the future, we can fix our mistake by exposing a getLastFocusResult()
that returns a Promise. It's not pretty, but I believe that it's a plan B that we'll never have to implement, so it's existence is enough to calm my nerves over this design decision.
This is not sufficient.
I don't understand why. Your example attack consists of capturing the wrong tab, but one which looks identical to the right one. How is focusing it making it less identical? How does it defang the attack? Focus seems orthogonal here.
I propose the following consolidation:
- No controller provided - legacy behavior retained (focus not specified).
- getDisplayMedia() with a reused CaptureController object -> rejected Promise before showing the user a dialog.
- If CaptureController.setFocusBehavior() is not called, focus behavior is not specified. It's a non-issue, and user agents will continue to implement their own behaviors.
- If getDisplayMedia() fails, all methods on the CaptureController will raise an exception.
- CaptureController.setFocusBehavior() raises an exception if called from a microtask other than the one getDisplayMedia() resolved on.
- Second call to CaptureController.setFocusBehavior() within the same aforementioned task raises an exception.
- CaptureController.setFocusBehavior() is no-op if called from the aforementioned microtask, but >= N milliseconds after that microtask has started executing. (I can go into details about why no-op is important for Chrome here, but I'm trying to keep the discussion focused for now, no pun intended.)
- Spec TODO to decide whether setFocusBehavior() before getDisplayMedia() is allowed, and if allowed, exactly how it works. (Chrome will implement this as no-op so as to ensure we can easily align with specified behavior once we reach consensus.)
If we can agree on this, I can prepare the PR.
because the risks outweighing the rewards.
Can you elaborate on the risks?
Could these risks also apply in case we delay focus by 1 second as proposed here?
Can you elaborate on the risks?
Please see the text beginning with "tricking the user to double-click..." and ending with "...of the captured surface" in this message.
Could these risks also apply in case we delay focus by 1 second as proposed here?
- Not credibly, due to the time limit. And we can tighten the time limit if necessary.
- The existing behavior of all user agents that support capturing windows or tabs, is to focus the shared surface. Effectively, Conditional Focus is an opt-out of this behavior, and into the less risky behavior.
Please see the text beginning with "tricking the user to double-click..." and ending with "...of the captured surface" in this message.
Your past message says that tricking a user to click will allow the web page to focus to the capturee page.
My question is what are the scenarios where you think this behaviour will be exploited for bad reasons.
2. Effectively, Conditional Focus is an opt-out of this behavior, and into the less risky behavior.
Why is not focusing the less risky behaviour?
Isn't it more risky to leave the user without any preview of what is being shared?
Also, it seems wrong if the more risky behaviour is the default behaviour.
Your past message says that tricking a user to click will allow the web page to focus to the capturee page.
My question is what are the scenarios where you think this behaviour will be exploited for bad reasons.
I am not sure where the unclarity lies. Two options have been discussed:
- Focus yes/no can only be decided right after getDisplayMedia() resolves.
- Focus yes/no can be decided at an arbitrary point.
I have explained the dangers inherent to 2, and explained that therefore, only 1 is in scope for me.
Namely, if we were to implement 2, an attacker could cause the user to click inside the captured page, at a location of the attacker's choice. This risk does NOT exist with 1 to any credible extent.
I hope this answers the question?
Why is not focusing the less risky behaviour?
Because the user keeps seeing the same page they had manually focused before, and with which they had interacted, and which would still be the active page if not for the getDisplayMedia() call.
Isn't it more risky to leave the user without any preview of what is being shared?
The user DID see a preview, as previously discussed. If you want to mandate a preview in the spec, we can do that.
Also, it seems wrong if the more risky behaviour is the default behaviour.
It's the existing behavior for all supporting browsers, but I am willing to specify differently if that helps drive us reach consensus.
Let's see if we can nail down the controller pattern first. Folks who have spoken up seem in agreement that:
- Any controller is associated 1 ←→ 1 synchronously in getDisplayMedia, and cannot be re-associated
- Without a controller, any focusing is implementation defined with a note to recommend web compat behavior (focus)
- Controller is dead if getDisplayMedia rejects or once capture dies (e.g. UA halts or all tracks have been stopped)
Great! Maybe we can nail down this one next, since it seems unrelated to timing:
- Presence of the controller suppresses the automatic bring-to-front with focus that happens today.
This removes implementation defined layer-violating behavior which is bad for tests, web developers, users, and screen recording:
It might make sense to allow UA to override the page asking not to focus.
I think this would defeat use cases such as screen recording. ...
For screen recording applications, isn't it the case that they will always want one or the other, irrespective of the selected surface, something like: ... Do not focus at getDisplayMediaTime."
Yes, and allowing user agents to override a page not wanting focus would defeat that, wouldn't it?
I think a user agent override here would require a strong end-user benefiting reason. Regarding "mis-selection":
I also agree that previews are the solution to mis-selection.
This is not sufficient. ...
I should have said: I don't think focus is the solution to mis-selection. It's the solution to VC presentation, and comes with its own risks like clickjacking. Mitigating one risk with another seems like an unnecessary bargain and a poor strategy.
Do we agree that supporting screen recording by letting users pick a displaySurface that isn't brought to front, is a requirement? If we don't have our requirements down, it's probably premature to discuss API.
I do agree with the assessment that "not focusing increases the risk of mis-selection" and that UA's should "limit this risk", but I think user agents need to solve this without focus in #211.
Do we agree that supporting screen recording by letting users pick a displaySurface that isn't brought to front, is a requirement? If we don't have our requirements down, it's probably premature to discuss API.
I definitely agree, and it seems to me that you agree too, @jan-ivar. We have discussed this API at #190 over multiple months and several interim meetings. We have all been discussing shape in-depth for a long time now, so if anyone were to call the entire thing into question at this stage, I'd be very curious to hear what had changed their mind.
I see points 1-8 in my message as sufficient for an MVP. Could everyone involved please mention:
- Which items they agree with.
- Which items they disagree with.
- Which items they believe are missing for a viable MVP.
I definitely agree, ... We have discussed this API at #190 over multiple months and several interim meetings. We have all been discussing shape in-depth for a long time now, so if anyone were to call the entire thing into question at this stage
I agree, but with my chair hat on: we have a process where urgency is for the chairs to manage outside of github so we can focus on receiving input on the topic here. chair hat off.
@jan-ivar, I'm assuming your last comment pertains only to the lines it quoted. So wrt to this part:
I see points 1-8 in my message as sufficient for an MVP. Could everyone involved please mention:
- Which items they agree with.
- Which items they disagree with.
- Which items they believe are missing for a viable MVP.
What is your answer?
Or are you saying that this is not the way forward towards consensus? If so - what's next, then?
(Btw, I am not sure what urgency is discussed in this comment. The entire premise of Conditional Focus is that the captured surface may be focused or not focused depending on the application's preference. This appeared to be acceptable to all concerned, as we were already diving deep into the particulars. It is reasonable to ask if that has changed, and if so, why.)
AIUI, current proposal is close to 2 but with the constraint that focus() is called only once around the time getDisplayMedia is called/resolved.
FWIW I disagree with this characterization, as it associates the OP proposal with features that aren't being proposed (straw man). My comparison to user activation was merely meant to suggest our concern over sites missing a timeout seems overblown, because the task of focusing a window is inherently racy with end users anyway who may be focusing windows themselves if they become impatient. E.g. as a user if my CPU is so busy it takes 2 seconds for a site to request focus, then chances are I'll have clicked somewhere else already which might cause focus to never happen. I question letting such edge cases impact API design.
- If CaptureController.setFocusBehavior() is not called, focus behavior is not specified.
Would this unspecified behavior be identical to point 1? An advantage of leaving the default focus behavior unchanged perhaps is that if the controller grows more orthogonal features in the future (e.g. send capture actions), people might add the controller to use those features only and be surprised by the inadvertent change in focus behavior. So maybe that part is better? OTOH in the send capture handle case users probably DON'T want to focus, so it's hard to say which is better.
- CaptureController.setFocusBehavior() raises an exception if called from a microtask other than the one getDisplayMedia() resolved on.
This seems like a rehash of #190 with all its microtask problems I raised there. Do we really want to go back there?
A microtask-sensitive promise-based API sounds terrible and wouldn't even work with Promise.race()
. Wouldn't that also preclude use of Capture Handle to answer the focus decision?
If we go that way, then a callback like @youennf suggested might work. But even that wouldn't stop JS from using XHR to do networking...
Would this unspecified behavior be identical to point 1?
Unspecified is unspecified, but at least in Chrome's implementation, yes, that's what I intend.
people might add the controller to use those features only and be surprised by the inadvertent change in focus behavior.
Exactly.
This seems like a rehash of #190 with all its microtask problems I raised there. Do we really want to go back there?
I don't see an alternative. Or did I misunderstand your mention of user-activation? Do you suggest we only use a timer? If so, my problems with that include:
- That seems like it's going to encourage Web-devs to produce buggy code that induces flaky behavior. Namely, they'll call
setFocusBehavior()
from a new task, see that it works on their machine, and fail to realize it flakes out 0.1% of the time on users' machines. But if we encourage them to call setFocusBehavior("no-focus-change") from the same microtask, their code end up flaking due to the timer far less often. - When the application fails to call
setFocusBehavior()
at all, the browser will be forced to wait N milliseconds before it performs its default behavior - which is hopefully identical to the existing behavior, which for both Chromium and Firefox happens to be focusing. Users that notice that screenshare ALWAYS produces a 1s delay before focus on a given application, are not going to be happy. (Admittedly, maybe the Web-devs should just do better. If this were my only argument, maybe I'd have relented on this topic.)
A microtask-sensitive promise-based API sounds terrible
I hope we can agree that setFocusBehavior()
itself will not return a Promise.
Or may I have misunderstood you on this particular point? If so, please clarify.
Wouldn't that also preclude use of Capture Handle to answer the focus decision?
Why? getCaptureHandle()
is synchronous. Or do you mean it won't be possible to talk to a remote server to help drive the decision? I don't think that's a problem, I expect the decision will always be made locally.
If we go that way, then a callback like @youennf suggested might work.
I'd have to refresh my memory on why we'd ended up backing away from callbacks. But it seems unnecessary. Why is it not simple to specify that setFocusBehavior()
be called from the very microtask on which gDM's Promise resolves?
After filing #231 I fear that focusing after capture has begun adds some risk. This perhaps limits the attractiveness of a timer unless it's quite short. cc @martinthomson
Would this unspecified behavior be identical to point 1?
Unspecified is unspecified, but at least in Chrome's implementation, yes, that's what I intend.
Good, we could still specify the invariant that it needs to be the same behavior.
Do you suggest we only use a timer?
Yes, that is all I proposed in the OP.
If so, my problems with that include:
That seems like it's going to encourage Web-devs to produce buggy code that induces flaky behavior. Namely, they'll call setFocusBehavior() from a new task, see that it works on their machine, and fail to realize it flakes out 0.1% of the time on users' machines.
As I mentioned I question whether this is real. If my machine takes more than one second to run a task then it's too late to focus reliably anyway. I'd argue it becomes unsafe to focus then #231.
When the application fails to call setFocusBehavior() at all, the browser will be forced to wait N milliseconds before it performs its default behavior
Not in my OP proposal.
A microtask-sensitive promise-based API sounds terrible
I hope we can agree that setFocusBehavior() itself will not return a Promise.
Or may I have misunderstood you on this particular point? If so, please clarify.
getDisplayMedia returns a promise. See #190 (comment). Microtasks are fickle:
const p = Promise.resolve();
function foo() {
return p;
}
async function bar() {
return p;
}
console.log(foo() === p); // true
console.log(bar() === p); // false
E.g. simple stubs like this would FAIL, which seems unattractive:
async function gDM(constraints) {
return navigator.mediaDevices.getDisplayMedia(constraints);
}
(async () => {
const controller = new CaptureController();
const stream = await gDM({controller});
controller.setFocusBehavior("focus-captured-surface"); // throws
})();
getCaptureHandle() is synchronous.
Ok I misunderstood the requirements then. A callback approach avoids this. E.g.:
const controller = new CaptureController();
let stream;
controller.pleaseFocus = () => stream.getVideoTracks()[0].getCaptureHandle().origin != window.origin;
stream = await navigator.mediaDevices.getDisplayMedia({controller});
Callbacks are ugly, and tasks and microtasks are equally vulnerable to busy waiting. So I'd prefer e.g.:
function queueTask(f) {
window.addEventListener("message", f, {once: true});
window.postMessage("hi", window.origin);
}
const controller = new CaptureController();
const stream = await navigator.mediaDevices.getDisplayMedia({controller});
controller.focus(); // works
await undefined;
controller.focus(); // works
await new Promise(queueTask);
controller.focus(); // throws
If the next task takes over a second to run then something's seriously broken and focus fails, which seems good. Thoughts?
The use cases I heard so far:
- app always wants focus
- app never wants focus
- app might want focus for some display surfaces, not others.
Those use cases would be covered by boolean focusTab and focusWindow preferences to DisplayMediaStreamOptions.
Or maybe a single focus preference that could take a vector of strings (tab, window, same-origin-tab...).
This should be quick and easy to spec/implement and avoids some of the issues we are trying to deal with.
Would it solve enough of the urgent use cases we are trying to cover?
This does not prevent from pursuing focus() in parallel for more advanced use cases.
This would be especially good if we can extend the duration where it could be called, with a proper security model.
The use cases I heard so far:
Youenn, the very first post of the original thread mentions the need for conditionally focusing depending on what application is being captured. Please read the Sample Use Case section here.
So the urgent use case you have in mind is web app focus decision based on information provided by capture handle identity?
So the urgent use case you have in mind is web app focus decision based on information provided by capture handle identity?
Correct. This has been presented in the WebRTC WG interim meetings over the last year. Speaking of which - happy anniversary to the original Conditional Focus thread! 🎂
And here is a deck from the WG meetings that explains this. See slides 12-14. The minutes reflect your contributions to the discussion.
@jan-ivar, if, as you propose, we specify that both getDisplayMedia() and getDisplayMedia({controller}) must have the same default unspecified behavior in the absence of a call to setFocusBehavior(), then it will effectively mean that the latter will focus by default. This means that for applications that don't call setFocusBehavior(), focus always changes after N milliseconds. Repeat - not just when the CPU is unusually busy, but always. That's not fun for the user, so I'd call this decision mutually-exclusive with the decision to rely solely on a timer.
Proposal
I think we can avoid the shimming issues associated with a microtask if we go with a task instead. (I think @alvestrand and @youennf have both suggested that before.)
Precisely formulated, when getDisplayMedia({controller})
is called, it returns a promise p
as per the getDisplayMedia algorithm. The normal steps execute, with the following additions:
- Define as WOO the "window of opportunity" for calling
setFocusBehavior
. - At the point where getDisplayMedia's algorithm resolves
p
, instead:- Schedule a microtask to resolve
p
. - Schedule a task (not microtask) to close the WOO.
- Declare the WOO open.
- Schedule a microtask to resolve
- Calling
setFocusBehavior()
from inside the WOO has the expected effects discussed elsewhere. - Calling outside of the WOO raise a synchronous exception.
- The WOO closes when the earlist of these happens:
- Application calls
setFocusBehavior()
. - The task to close the WOO executes.
- All tracks of the MediaSream are closed. (Or maybe the video track is enough?)
- The document that called getDisplayMedia loses activation/focus. (See open questions below, btw.)
- Application calls
- To defend against timing issues described earlier on both #190 and #230, we also specify a timer (or just allow it - see below). Calls to
setFocusBehavior()
when the WOO is still open, but after the timer elapses, are no-op (no exception). Rationale:- Timer + exception becomes a bad mix once we introduce IPC. And we have to account for the possibility of IPC. E.g. in Chrome, the timer will be on the trusted browser process, not in the content process, so as to avoid compromised content processes from being able to focus at an arbitrary time.
- This uncommon edge-case is mostly inactionable for the application.
Open questions
- What happens when the controller and/or tracks are transferred? Should we still tie WOO to the activation/focus of the document which originally called getDisplayMedia?
- Possibly the simplest solution is to close WOO as soon as either track or controller is transfered?
- Should the timer be specified, or should it be left up to the UA?
- I would prefer specifying.
At the point where getDisplayMedia's algorithm resolves
p
, instead:
- Schedule a microtask to resolve
p
.
s/microtask/task.
- Schedule a task (not microtask) to close the WOO.
This sounds good enough to me.
It is stricter than what @jan-ivar proposes so seems like a good start as it is always feasible to relax in the future.
- Calls to
setFocusBehavior()
when the WOO is still open, but after the timer elapses, are no-op (no exception).
Rejecting the promise is probably better.
There is no harm in doing so AFAIUI and the promise might reject for other reasons (say user sets a preference to always focus no matter what web page is saying).
- What happens when the controller and/or tracks are transferred?
Controller is not transferrable so far.
Transferring a track seems orthogonal, ditto for track being stopped.
- Should the timer be specified, or should it be left up to the UA?
It should probably be left to UA as this should be an edge case.
My understanding is that:
- The algorithm would allow to focus for the current task only.
- The security section would detail the threats and potential mitigations like the timer approach you are proposing.
Rejecting the promise is probably better.
We can do the following if you'd like:
- Timer in the content process elapses -> reject the promise.
- Timer in the trusted process elapses -> silent failure as no-op.
Note:
WOO-start(browser) <= WOO-start(content)
andWOO-end(browser) <= WOO-end(content)
, and obviousstart < end
for both, but nothing else can be guaranteed.- It could be that setFocusBehavior() is called after 0.4999 seconds from one process's view, but after 0.5001 seconds from another's. This will lead to no-op.
Controller is not transferrable so far.
IIRC, when presenting to the WG, or possibly during our editor meetings, @jan-ivar spoke of the controller being transferable. We can start with it as untransferable and iterate, though. That's more than fine with me.
ditto for track being stopped
I'd still want focus-change to stop then, but it's not terribly important for me given the short timer, and how the user would not really notice the ordering of track-stopped focus-changed anyway.
It should probably be left to UA as this should be an edge case.
Let's go with that then. It would be hard for me to commit to a value now. I'd rather set it experimentally and be free to tweak it if the need arises.
Sounds like we're nearly there? If @jan-ivar agrees, I can send the PR.
- Timer in the trusted process elapses -> silent failure as no-op.
Why is it important to silent fail this case? It seems better to report the error.
With regards to setFocusBehavior vs. focus, I think we should have a discussion about whether we think there is a path forward for supporting focus() at an arbitrary time (as long as capture happens).
Why is it important to silent fail this case? It seems better to report the error.
Because we want setFocusBehavior()
to be sync. Or do you see this case as different from that of CropTarget creation? Back then, Apple and Mozilla felt very strongly that anything that could be sync, should be sync. Has that changed?
With regards to setFocusBehavior vs. focus, I think we should have a discussion about whether we think there is a path forward for supporting focus() at an arbitrary time (as long as capture happens).
Focus at arbitrary time - not at the moment, and I suspect not at all from Chrome's perspective. But suppose a path is found later - we can use it to deprecate setFocusBehavior()
. Still:
- I'd rather that, for now, we go with the most explicit shape.
- Recall that it's unspecified whether not calling
setFocusBehavior()
should focus or not.
In the CropTarget case, the value returned by the function is the CropTarget itself.
Here, the returned value is whether the UA complied with the desire of the web page to focus or not.
It makes sense to return this value as a promise since UA decision is async.
The question you might raise is whether there is enough value for exposing this information or not.
I tend to think this has some limited benefits and is more future proof.
Here, the returned value is whether the UA complied with the desire of the web page to focus or not.
I don't think we should report that. Please see earlier comments on this thread.
The question you might raise is whether there is enough value for exposing this information or not.
I tend to think this has some limited benefits and is more future proof.
There are no immediate benefits to justify the complexity. If benefits are later discovered, changing from undefined func()
to Promise<> func()
is a non-breaking change. (Excepting exceptions, perhaps.)
Great to see progress! Quick comments:
- controller should be non-transferable
- I'm ok with sync for now
- I still prefer
focus()
oversetFocusBehavior("focus-captured-surface")
The last point is not for future proofing, but because it is simple and elegant.
Browsers' behavior today assumes a specific task. It is taking automatic action toward that task. I think if you attach a controller then it's like turning off autopilot and taking over control, and the assumption flips. You are more likely to remote control the capture in some way, so not losing focus seems like a logical starting point.
So far, the features planned for this controller (sendActions, maybe move identity?) fit this assumption.
Regarding timer, with focus()
it's simpler: no concern over delaying a default action. The window simply won't be focused past some time.
Do we have TPAC time for this? If so it is probably more urgent to make slides than a PR. We could present both options even.
- controller should be non-transferable
I'm happy to start with that. (I imagine you might wish to revise that in the future for Capture Actions. I think we'll be able to.)
- I'm ok with sync for now
Awesome, agreement there too.
- I still prefer
focus()
oversetFocusBehavior("focus-captured-surface")
Browsers' behavior today assumes a specific task. It is taking automatic action toward that task. I think if you attach a controller then it's like turning off autopilot and taking over control, and the assumption flips. You are more likely to remote control the capture in some way, so not losing focus seems like a logical starting point.
Please see my point in the first paragraph of this message. Namely, your requirement of same unspecified default behavior with/without controller
is mutually exclusive with our new func() can only focus, not prevent focus
. Please choose one. My own preference is to retain same unspecified default behavior
.
Do we have TPAC time for this? If so it is probably more urgent to make slides than a PR. We could present both options even.
Yes, I intend to use the time set for me to present Conditional Focus. Naturally this will touch on the controller. Do you want to pull our slots together or reorder them so that you may present your wider vision for the controller closer to the presentation of its first use of Conditional Focus?
Regarding timer, with
focus()
it's simpler: no concern over delaying a default action. The window simply won't be focused past some time.
This appears to be the main remaining open issue. Let's tackle it. I'll explain my position again.
- Timers always produce flakiness. That's a necessary evil and we'll live with it. But it's still evil, so we should minimize it.
- Code that calls
setFocusBehavior()
sooner will be less flaky than code that callssetFocusBehavior()
later. We should strongly encourage developers to callsetFocusBehavior()
ASAP. - The task-based window-of-opportunity presented here does that. Consider some flakiness-prone patterns like doing communication with a remote server to determine whether to focus, or posting a task to do one more thing and then call
setFocusBehavior()
.- With only a timer, these anti-patterns will work 99% of the time, and developers might come to employ them. ("Works 100% of the time on my setup.")
- With a timer and the task-closing-WOO, they will work 0% of the time, forcing developers to rely on better patterns instead.
- Users will benefit from developers never taking the flakier route.
Summary of conclusions from the TPAC session of 2022-09-13 (slides):
- We'll not specify the default behavior (focus/no-focus) for either case (controller/no-controller).
- The UA is not compelled to abide by the application's request to focus/not-focus. (Accommodating Safari's implementation, where the user clicks on the window itself and brings it to the forefront as part of the flow to start capture.)
- The API will apply to both tabs and windows. (But the UA may disregard the application's preference in both/either case.)
- The shape will be setFocusBehavior(enum), citing the principle that code is more often read than written, and an explicit reference to focus-behavior produces clearer code.
- The selected approach is task-to-close-window-of-opportunity.
Regarding window of opportunity, I think we should put a note that the goal is to allow for the current task only and that the current algorithm is an approximation.
Regarding window of opportunity, I think we should put a note that the goal is to allow for the current task only and that the current algorithm is an approximation.
Works for me.
Thanks for the PR!