w3c/mediacapture-screen-share

Avoid Hall-of-Mirrors

eladalon1983 opened this issue · 21 comments

Screen Shot 2022-02-22 at 20 23 23

Accidental self-capture is a common problem for video conferencing software. When users accidentally choose the tab in which the VC app is running, a Hall-of-Mirrors effect is produced, confusing users and derailing discussions with remote users. Eliminating this would improve the lives of users and Web-developers alike.

When getDisplayMedia() is called, the user agent is supposed to show an unconstrained list of all possible surfaces, and that includes the tab from which getDisplayMedia was called. However, members of this Working Group have in the past expressed a desire to exclude the current tab from this list, with the rationale that it's an attack-vector for bypassing origin separation. (That is to say - an application capturing itself can embed arbitrary content and see it.) I intend to propose a mechanism in this vein. The mechanism I am about to propose does not purport to address the security concern. But I argue that it offers no degradation in security.

The first option to consider is that browsers could simply eliminate the option without a change in spec. However, this solution is unlikely to be universally adopted, as some legitimate applications currently use self-capture, and some existing browser would be apprehensive of breaking them (e.g. Chrome).

We could (and should) step in and help bridge the gap, for the betterment of the Web platform. I propose the following:

Recall that getDisplayMedia's argument is of type DisplayMediaStreamConstraints. That is:

partial interface MediaDevices {
  Promise<MediaStream> getDisplayMedia(optional DisplayMediaStreamConstraints constraints = {});
};

I propose that we extend DisplayMediaStreamConstraints with excludeCurrentTab:

dictionary DisplayMediaStreamConstraints {
  (boolean or MediaTrackConstraints) video = true;
  (boolean or MediaTrackConstraints) audio = false;
  boolean excludeCurrentTab = ?;  // Default value - let's discuss.
};

If excludeCurrentTab is set to true, the user agent MAY/SHOULD/MUST exclude the current tab from the list of tabs.
If excludeCurrentTab is set to false, the user agent does as it wishes. (Chrome will include the current tab, as it does today, but other browsers are free to exclude the current tab either way.)

I'm OK with this idea. Consider excludeSelf as the name, though that leads to questions about the effect on tab, window, and display in that it might prevent display capture entirely for those people who only have one display (i.e., most).

I'm OK with this idea. Consider excludeSelf as the name, though that leads to questions about the effect on tab, window, and display in that it might prevent display capture entirely for those people who only have one display (i.e., most).

Interesting idea with excludeSelf. Thinking out loud:

  • If we make it refer specifically to excluding the current tab, it would be misnamed to the point of confusing developers.
  • If we make it exclude all "self" surfaces - current-tab, current-window and current-screen - then I'd be concerned that many video conferencing software would simply not use it. (Current users often share the whole screen, and VC software would not wish to lose users.)

I think it might be interesting to also have distinct controls for excludeCurrentTab, excludeCurrentWindow and excludeCurrentMonitor. Either that, or a single cumulative enum value, e.g. exclude= and then currentTab, currentTabAndWindow, currentTabWindowAndScreen.

Devil's advocate against my own idea: Would a proliferation of selection-limiting controls go too far and allow an application to tip the user's hand towards a specific surface?

Answer: With any type of self-capture, a malicious application can directly embed what it wishes to capture. So self-capture is equivalent there to forced-capture. With one exception - native apps. But that seems like an unlikely attack for me, as a malicious application would likely much prefer self-capture through the browser window, than capture of an arbitrary native app.

Your thoughts?

jesup commented

If the point of the security risk is that a site may use capture of itself to export arbitrary content under it's control, breaking cross-origin capture restrictions, this does nothing at all to address it. A malicious app would of course not exclude the current tab, and would in fact give the user a reason for wanting to capture the current tab.

If the point of the security risk is that a site may use capture of itself to export arbitrary content under it's control, breaking cross-origin capture restrictions, this does nothing at all to address it. A malicious app would of course not exclude the current tab, and would in fact give the user a reason for wanting to capture the current tab.

I urge you to reread my original message more carefully. Specifically, this part: "The mechanism I am about to propose does not purport to address the security concern. But I argue that it offers no degradation in security." But you might wish to read the entire message again so that you may understand what I do argue.

The discussion seems mostly focused on optimising pickers.
The core issue being Hall-of-mirrors, it seems more adequate to allow the User Agent to obfuscate the region of the captured stream that corresponds to the web page doing the capture.
Exposing a property for this might be useful and would have no security downsides.
This does not prohibit User Agents to optimise their pickers based on that property.

This does not prohibit User Agents to optimise their pickers based on that property.

Your suggestion makes sense to me up to this point. :-)
I see two issues:

  1. Obfuscating some region and changing pickers - these are mostly orthogonal issues. Why should the UA infer one from the other? That's a compat issue. What will an application that expects anything-but-current-tab (on Chrome) do when it gets current-tab-but-obfuscated (on Safari)?
  2. Implementing obfuscation will take time, and AFAICT is not on anyone's agenda. (Not on Chrome's agenda at any rate. Let me know if it is on yours.) What will we do until obfuscation is implemented? Not expose its control, and therefore not infer anything in the picker? Expose it and make it no-op for its stated-purpose of obfuscation, but use it for inference?

Obfuscation is great. But it won't eliminate the need for excludeCurrentTab. So let's discuss these as separate proposals.

Obfuscation is great. But it won't eliminate the need for excludeCurrentTab

Obfuscation is solving Hall-Of-Mirrors, which the issue is about.
I do not think we should do excludeCurrentTab with Hall-Of-Mirrors as the main motivation.
If excludeCurrentTab has other benefits, that is fine, let's file a dedicated issue about this idea and detail all its potential benefits and drawbacks.

  1. What will an application that expects anything-but-current-tab (on Chrome) do when it gets current-tab-but-obfuscated (on Safari)?

Applications will need to deal with this even with excludeCurrentTab, say if selecting a window or screen.
Or are you saying excludeCurrentTab would forbid user selection of some surfaces? That would be a big change of doctrine.

With regards to security analysis, I think the current tab is a somewhat safer surface than other tabs controlled by the attacker: other tabs can be navigated while being captured, the current tab can only navigate iframes.

Or are you saying excludeCurrentTab would forbid user selection of some surfaces? That would be a big change of doctrine.

That is exactly what I am saying. :-) My original message in this thread acknowledged the prior discussions in this vein, and explained why I believe I have new things to say on this matter, and why excludeCurrentTab is potentially different. Could you please re-read that message? (My second message in this thread further elaborates on this topic, btw.)

Obfuscation is solving Hall-Of-Mirrors

I disagree. When doing whole-screen-capture, obfuscation helps with Hall of Mirrors. But it's not at all useful for tab-capture, because in that case, the entire track would be obfuscated.

Applications will need to deal with this even with excludeCurrentTab, say if selecting a window or screen.

Sorry, but I don't think that is relevant to my response to your suggestion. The hypothetical application I was discussing there, would have been ready to handle {any-tab-other-than-the-current-one, any-window, any-screen}. If you think that is unlikely, simply consider an application that stops() the track if displaySurface != 'browser'. The compat issue arises when Chrome follows your suggestion and returns anything-but-current-tab, while Safari can still return the current tab.

With regards to security analysis, I think the current tab is a somewhat safer surface than other tabs controlled by the attacker: other tabs can be navigated while being captured, the current tab can only navigate iframes.

The current tab is almost always under the attacker's control.
Other tabs - not as often.

To echo @martinthomson, I'm OK with this idea.

My long-view is getViewportMedia is for self-capture, and getDisplayMedia is for everything else. So anything that weans users off of expecting the current tab in the picker early seems good.

@jesup I agree there's no security improvement here. While I'm slightly concerned this change might lull users into not expecting the current tab to be in the picker ever, turning this into a marginal social engineering advantage for a malicious site (e.g. "See, the browser even has special rules for us asking you this"), I don't think it outweighs the benefits.

  boolean excludeCurrentTab = ?;  // Default value - let's discuss.

@eladalon1983 I'd like = true, based on where we'd like to end up.

But to follow the TAG's "see also" design principle: "APIs that have boolean arguments defaulting to true" it'd have to be:

  boolean includeCurrentTab = false;

... some legitimate applications currently use self-capture, and some existing browser would be apprehensive of breaking them (e.g. Chrome).

Do you have a list of these apps? If it's from here, they should be unaffected.

Anecdotally based on every video meeting I've been in, capture of other tabs dwarfs self-capture. So without further data, I'd guess the few apps that rely on socially engineering their users to select a specific choice in the picker, are industrious enough to add includeCurrentTab: true to keep their apps working.

When doing whole-screen-capture, obfuscation helps with Hall of Mirrors. But it's not at all useful for tab-capture, because in that case, the entire track would be obfuscated.

For self-tab-capture, Hall of Mirrors can be mitigated by the web page itself: region cropping, not previewing the captured stream...

As I said before, the goal of this property is not to solve Hall of Mirrors, but to solve the case of a web app wanting any surface but current tab. This might be a fine goal.
As long as we start with having this as a hint, this seems fine to me. We can always add stronger wording, say when getViewPortMedia becomes a thing.

The current tab is almost always under the attacker's control.
Other tabs - not as often.

Yes, but the current tab is only able to embed iframes, which are much much less frightening. These days, cookies get partitioned, sites are protecting themselves from being embedded.

Through social engineering, it might be easy to train user to select self tab for instance. Then the attacker could do something like:

  1. Attacker opens a new tab, attacker knows the new tab is next to the current tab and will be next to the current tab in screen picker. When user is back to the current tab, attacker navigates the tab to a same origin page similar to current tab.
  2. call getDisplayMedia with prefer tab surface + excludeCurrentTab. User thinks it is selecting the current tab (the only one with the attacker's origin, and preview is ok) but is in fact selecting the new tab. Of course attacker makes sure to disable focus so that user remains on the current tab, which removes an additional defense.
  3. Attacker can now navigate the new tab (which was never visible to the user from the start of the capture) to more interesting websites and harvest plenty of first-party data.

Given the potential increase of APIs to control UX (picker hints, conditional focus), I think we should strengthen the privacy/security considerations with regards to tab capture, in particular when capture continues across cross-origin navigations. Something like:

  • More clearly describe the specific risks of tab capture. Include the new APIs in the tab capture risk analysis
  • Describe existing or potential mitigations: make it clear which tab is being captured, make it clear tab capture has changed origin, mute/disable in case of (silent) navigations until user can properly notice it...
    Doing this effort will allow a collective assessment from the WG that we are not slowly weakening our privacy/security story.

As long as we start with having this as a hint, this seems fine to me.

Good to hear, @youennf!

To echo @martinthomson, I'm OK with this idea.

Good to hear, @jan-ivar!

I'd like = true, based on where we'd like to end up.

I'd suggest we start with default-include-current-tab, which is basically no-op compared to current behavior[1], and avoid breaking applications. When getViewportMedia is specified, implemented and embraced by the Web, we can change this. (When and if.) It's premature at the moment.

Do you have a list of these apps? If it's from here, they should be unaffected.

Any app that wants to allow the user to record an arbitrary surface and save it to disk/cloud, and does not display the recorded stream back to the user, is not in danger of HoM, and might not wish for the current tab to be excluded from the list of surfaces offered to the user.

As a concrete real-life example:

loom

The above image is an illustration using a Web-application called Loom[2]. This application allows users to record any surface. If the user captures the current tab, then the user can record a demo of using the application itself. (Loom appears to have posted to YouTube ~45 videos of how one uses their app.) In the future, maybe such applications could embed slides directly inside their own app, making self-capture even more of an interesting use-case, while still keeping other surfaces relevant choices for the user.

--
[1] I can only speak about Chromium because it's the only one with tab-sharing at the moment. As we all work in groups where decision-making is distributed, I'd like to ask: Is it Mozilla's opinion as a group, that if Firefox comes to support tab-sharing before we specify getViewportMedia and excludeCurrentTab, then gDM will not allow the user to share the current tab? Will this be the behavior Firefox UX and PMs choose to ship?
[2] I am not affiliated with Loom and they have not approved this message.

When getViewportMedia is specified, implemented and embraced by the Web, we can change this.

We can't because of the design principle I mentioned above, so we need to decide on the default first.

Loom ... If the user captures the current tab, then the user can record a demo of using the application itself. (Loom appears to have posted to YouTube ~45 videos of how one uses their app.)

Thanks, but I don't accept this as a use case, because it seems trivial (better even) to open the Loom page you want to record in a separate tab (which lets you demo any page, not just ones with recording controls embedded in them).

Is it Mozilla's opinion as a group, that if Firefox comes to support tab-sharing before we specify getViewportMedia and excludeCurrentTab, then gDM will not allow the user to share the current tab? Will this be the behavior Firefox UX and PMs choose to ship?

PMs don't drive specs at Mozilla. Also, "which choices are available to choose from is up to the user agent", so no answer owed. I think the spec question isolates to: allow apps to express that A) they want self-capture, or B) they don't want self-capture. Whether they get it or not we already decided was irrelevant (a hint). I prefer A, because most sites are in B.

We can't because of the design principle I mentioned above, so we need to decide on the default first.

Clarification: I don't mind if it's excludeCurrentTab or includeCurrentTab. Given that you can cite a an established principle for making it includeCurrentTab, I am 100% OK with that, and everything else in my comment assumes this change. To avoid distraction, I try to use "default-include-current-tab", as it would otherwise be confusing if I mean include/exclude when I say "default-false" or "default-true".

Could you please re-read my comment in light of this clarification?

Thanks, but I don't accept this as a use case, because it seems trivial (better even) to open the Loom page you want to record in a separate tab (which lets you demo any page, not just ones with recording controls embedded in them).

It's also trivial to install a native app when Web applications can't do what the end-user wants, and end up contorting themselves in Web-y ways that the end-user assumes is due to lack of polish.

Quick reminder: We're discussing whether the default value should be default-include or default-exclude. (That is, includeCurrentTab defaults to true/false, respectively.) I think that we don't want to break legitimate applications, so let's keep the default value equal to the current behavior. Since all tab-capture-supporting browsers have the same behavior here (2022-03-14), the non-breaking value is clear.

allow apps to express that A) they want self-capture, or B) they don't want self-capture. Whether they get it or not we already decided was irrelevant (a hint). I prefer A, because most sites are in B.

I think this refers to the solved question of include vs. exclude.

@eladalon1983 No, naming isn't orthogonal to defaults. The only two options are:

boolean includeCurrentTab = false; // default-exclude-current-tab (A)

and

boolean excludeCurrentTab = false; // default-include-current-tab (B)

...because we cannot use = true.

Please reread my comment in light of this clarification.

But to follow the TAG's "see also" design principle: "APIs that have boolean arguments defaulting to true" it'd have to be:

Having reread this, I see the source of misunderstanding. Please be advised that when consecutive multi-word links, readers might miss that the the first N words link in one direction, while the remaining K words link elsewhere. Such readers would only end up reading one of your multiple linked materials. In our case, "Prefer dictionary arguments over primitive arguments" was all I saw.

Now that this is cleared up - how about leaving out the default value altogether?

Based on the interim from 2022-03-15, how about:

enum CurrentTabPreferenceEnum {
  "include",
  "exclude"
};

dictionary DisplayMediaStreamConstraints {
  (boolean or MediaTrackConstraints) video = true;
  (boolean or MediaTrackConstraints) audio = false;
  CurrentTabPreferenceEnum currentTabPreference;
};

And then:

If a value is specified for currentTabPreference, it is a hint. 
The user agent MAY regard this hint when deciding whether
to include the current tab in the list of surfaces it offers to the user.

@jan-ivar and @youennf, please comment on this general direction, and I will gladly produce a PR for our next Thursday meeting.

Hi @jan-ivar and @youennf. So that I may produce a PR, could you please comment as to whether you agree with this interpretation of the conclusion of the meeting from 2022-03-15?

Since we can't agree on a default, this seems like a reasonable compromise. Let's bikeshed in the PR.

Since we can't agree on a default, this seems like a reasonable compromise. Let's bikeshed in the PR.

Go for it.
Note: Default value of "none" seemed like a clearer way of communicating "no default value", given that this value communicates a preference. (That is, preference=none means "up to the user agent".) Let me know what you think.

Adding my support for this addition to the spec. At tella.tv we have a web recorder, and we'd never want to let the user record the current tab (current recording session). This will result in the hall of mirrors effect and won't ever result in a good recording:
image
So if this is added, we would always turn it on to exclude the current tab from the options to choose.
We do sometimes want to demo our own app, but in those cases we just open a second tab to do the demo; one for recording, the other to demo.