Add more recent HTML elements to rules for identifying an unknown MIME type

Question

Add more recent HTML elements to rules for identifying an unknown MIME type

unreleased opened this issue 2 years ago · 6 comments

https://mimesniff.spec.whatwg.org/commit-snapshots/bfcd9341ca095bc39fc2a8f0e38de73e172785af/#ref-for-byte①⑤

Specifically: https://mimesniff.spec.whatwg.org/#identifying-a-resource-with-an-unknown-mime-type

Hi, I'm doing some research regarding MIME sniffing and was wondering if this standard should be updated to better reflect HTML5? Although most of the key tags are there, a bunch could really do with being added to better reflect the standard version of HTML most people are using.

Answer 1 · 2022-08-24T08:57:07.000Z

Could you perhaps explain what you mean in a bit more detail or with an example?

Answer 2 · 2022-09-02T19:16:09.000Z

The HTML tags that have been defined for the mime sniffing standard are mostly outdated HTML4 tags and no modern HTML5 tags.

Tags like <b>, <font> are outdated and although these should be kept for fallback support there are no modern HTML5 tags listed (to name a few: <section>, <main>, <footer> and <header>).

Side question: Do you have any more information on how these specific tags were selected to be used versus other tags?

Answer 3 · 2022-09-04T12:53:14.000Z

I see, that's not a list we necessarily want to update. That's how browsers sniffed for HTML at some point and updating the sniffing rules has all kinds of implications, including for security. It's better for it to remain a static set of rules and have new content use MIME types correctly.

Answer 4 · 2022-09-04T13:11:07.000Z

Indeed, this is an algorithm designed for processing legacy Web content. It should not be updated to include newer HTML elements without significant justification.

Answer 5 · 2022-09-04T14:19:16.000Z

Hi, @GPHemsley,

I'm unsure what you mean when you say this algorithm is just used for processing legacy web content. There is nothing legacy about this algorithm. This algorithm is used by every single major browser on every single request that doesn't return a content-type to determine which content type a response is and whether to display it as HTML or plain text (or others)

@annevk How can you just say "have new content use mime types correctly" isn't that the whole purpose of this spec? That feels like the equivalent of saying "just do it right". Am I missing something? Is this not the primary specification for mime sniffing?

Introducing additional valid tags provides no additional negative security implementations. The same as removing or changing a tag would not. The base spec and algorithm for detection are already designed, it's just introducing additional tags to support modern content. Firefox has actually decided to ignore the spec and add more modern HTML5 tags.

Refs:
https://github.com/chromium/chromium/blob/main/net/base/mime_sniffer.cc
https://github.com/WebKit/WebKit/blob/14dfa22fca058e560506ff7898d6272fe6a74a32/Source/WebCore/platform/network/MIMESniffing.cpp
https://searchfox.org/mozilla-central/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp

Answer 6 · 2022-09-06T14:32:22.000Z

@unreleased

Modern servers are expected to conform to modern standards and best practices (such as those described in HTML and Fetch). Failure to do so would be expected to behave in the same wrong way across modern browsers and thus would be quickly corrected.

The purpose of this spec is to standardize how to process legacy Web content that predates those modern standards and best practices. It is laying out for modern browser developers how to strike the balance between being safe and being functional so that they don't have to guess for themselves.

From the Introduction:

This document describes a content sniffing algorithm that carefully balances the compatibility needs of user agent with the security constraints imposed by existing web content. The algorithm originated from research conducted by Adam Barth, Juan Caballero, and Dawn Song, based on content sniffing algorithms present in popular user agents, an extensive database of existing web content, and metrics collected from implementations deployed to a sizable number of users. [SECCONTSNIFF]

Adding more elements to the rules for identifying an unknown MIME type would change what was detected as HTML. (If it didn't, why would you need to add them?) And detecting something as HTML that is not HTML (or which other mechanisms don't see as HTML) has security implications.

There is no reason to add post–HTML 4 elements to the algorithm because legacy HTML content does not contain those elements.

You are correct to point out that Firefox includes more elements than are listed in the spec, but it is inaccurate to describe that as "a decision to ignore the spec", as the code doing so predates the spec itself, and is likely part of the reason the research underlying the spec was conducted in the first place. (That section of the code dates back to 2003.) Additionally, the extra elements that it includes are not new HTML elements but rather more legacy elements.

It may be the case that the Firefox team has justification for using these other elements, in which case they should present that as a distinct proposal to update the spec. Or it may be that they have simply not gotten around to bringing that section of code into conformance with the spec, in which case filing an issue with them may be the best course of action.

Either way, though, there remains no justification for adding more recent HTML elements to the rules for identifying an unknown MIME type.