Basic URL parse requires stripping tabs before host state is entered, allowing bad hosts
Closed this issue · 20 comments
What is the issue with the URL Standard?
In this document:
https://url.spec.whatwg.org/#concept-basic-url-parser
Item 3 says:
Remove all ASCII tab or newline from input.
After this it proceeds to describe how different parsing states should be processed and in host state
/hostname state
it states that a bad host should result in a parsing termination error (points 3 and 4):
Let host be the result of host parsing buffer with url is not special.
If host is failure, then return failure.
In host parsing, it says that a forbidden code point should terminate parsing:
If asciiDomain contains a forbidden domain code point, domain-invalid-code-point validation error, return failure.
Finally, forbidden host code point includes tab as an invalid character, which should fail URL parsing or a manufactured host name will be produced.
This ordering of stripping all tabs from a URL and then not allowing tabs in host names prevents host names from being validated properly (i.e. invalid characters are removed before they can be evaluated).
This has an immediate effect on some of the current libraries. For example Python's urlsplit
will take abc<tab>xyz.test
and will manufacture a host name abcxyz.test
, which happens because they remove tabs from the URL, before having a chance to validate the host name.
I'm not sure I understand how abc<tab>xyz.test
is fed into the URL parser. As-is that would fail due to the lack of a base URL.
But generally what you describe is how it should work.
I'm not sure I understand how abcxyz.test is fed into the URL parser
If the parser would only see valid URLs, there would be no need for erroneous states, which is not the case in the real world.
Spammers and hackers are always look for ways to inject bad stuff and this one is fits well for this purpose - a preliminary scanner may not see a known spam URL because of a tab character, while the Python parser will manufacture a URL that may potentially bypass that scanner.
The point of the whole sequence described in the spec is to reject bad URLs, which in this case is circumvented by the order of operations in which tabs are stripped before they can be validated. In other words, why would the spec even identify the tab as an invalid character when a tab can never reach the host parser?
It can reach the host parser when that is directly invoked. E.g., that is why document.domain = "\texample.com"
throws on https://example.com
.
It's the same domain, whether it is within the URL or set separately, and this logic treats that same domain as two different ones, depending on how the domain parser is reached.
That is, if one sets the URL as https://abc<tab>xyz.test/path
, this logic yields a bogus abcxyz.test
domain, but if someone splits it based on URL nomenclature and feeds abc<tab>xyz
to the domain parser, from the same URL, mind you, then it's flagged as invalid.
This logic is an open invitation for security issues, where whitespace sprinkled across URLs, including domains, will be silently removed, yielding a different URL and its individual components. I'm honestly surprised this is not considered as a potential security issue or, at least, inconsistent behavior.
The behavior is indeed not ideal, but we are rather constrained with respect to the changes we are able to make. This does not seem like something we would be able to remove without breaking a portion of the web.
Glad you see the inconsistency I'm describing (or at least see this dual interpretation of the URL components as not ideal).
I also can see where this requirement originally came from - the RFC 3986 does describe that URLs embedded in plain text may contain whitespace (e.g. RFC 3986, Appendix C):
In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may
have to be added to break a long URI across lines. The whitespace
should be ignored when the URI is extracted.
However, they do qualify later that this is required for user-types URLs:
For robustness, software that accepts user-typed URI should attempt
to recognize and strip both delimiters and embedded whitespace.
Perhaps a non-normative note can be added to this spec to describe that whitespace stripping is intended for contexts where user input is expected and otherwise it should not be stripped, which would conform to RFC 3986 in terms of individual URI component grammar.
Without this clarification even today browsers are not consistent in how this rule is interpreted. Take this example:
<p>Link with tabs: <a target="_blank" href="https://www.	github	.com/	?abc	xyz=12	34">ABC</a></p>
<p>Link with spaces: <a target="_blank" href="https://www. github .com/ ?abc xyz=12 34">ABC</a></p>
<p>Link without spaces: <a target="_blank" href="https://www.github.com/?abcxyz=1234">ABC</a></p>
Firefox and Chrome will strip tabs, but will leave spaces and Firefox won't do anything with this link and Chrome will replace spaces with %20
and will open that bogus URL.
HTML anchors are not user-typed input and a non-normative note like this would allow implementers to take URL context into consideration when parsing URLs.
This requirement is not about user-typed URLs. Browser address bar behavior is not codified in standards, that's largely the realm of UI.
The inconsistency in the handling of spaces also seems unrelated to the issue with stripping newlines and tabs. There's test coverage already for that and hopefully Chrome will eventually fix their bug as part of the WPT Interop effort or in some other way.
This requirement is not about user-typed URLs. Browser address bar behavior is not codified in standards, that's largely the realm of UI.
If you look at their example, it doesn't have much to do with browsers at all - it's human-typed URL in all contexts human, such as emails, text messages, etc. These are the only contexts where this type of whitespace removal makes any sense. In machine-consumed contexts, such as href
, whitespace in domains and URLs has no meaningful use and only will serve as a source of complexities and vulnerabilities.
This requirement did not come from that RFC. This is how browsers parsed URLs long before that RFC existed. The web relies on it.
The fact that all browsers (it's not just Chrome - Firefox, Opera, obviously Edge) interpret this parsing step in the requirement differently indicates that developers cope with this requirement everywhere.
All I'm asking is for the spec to provide guidance in when whitespace should be stripped, which should be accompanied by a note with a use case (can you think of one? I cannot), and when it is better to refrain from it, which would cover all usual programmatic uses of URLs, like the broken URL parser in Python.
I'm not sure what you mean by differently? Which browsers does not strip a tab or newline from URL parser input?
All I'm asking is for the spec to provide guidance in when whitespace should be stripped
It already does this.
which should be accompanied by a note with a use case (can you think of one? I cannot)
I'm not sure why. The fact that all implementations already do this means it cannot be removed.
I think it would also be unwise for Python to deviate from this as it would mean it ends up with different results from web browsers.
I'm not sure what you mean by differently?
I was referring to how browsers handle spaces differently from tabs and that Firefox won't follow the link with spaces and other browsers would encode them as %20
, but re-reading the spec, I see that I missed the fact that space is not included in this step, so it's irrelevant. My apologies for misinterpreting this step. Please disregard my point about browsers.
Returning back to the original point about about tabs in domains, let me ask you this question. Given items 2 and 3 in the basic URL parser:
If input contains any ASCII tab or newline, invalid-URL-unit validation error.
Remove all ASCII tab or newline from input.
I read these two steps as that conforming parsers should (or are encouraged to) provide an indicator of a validation error if a tab or a space (or other non-URL-code-point) are encountered.
Would this in turn suggest that parsers, like the one in Python, should provide some way to request a validation failure when any of those characters are encountered, or at least provide some feedback in that the returned parsed URL components may have been modified because of this validation failure?
See https://url.spec.whatwg.org/#validation-error for more information on validation errors.
I did read this, and I'm asking for your guidance in that whether you would agree that a modern conforming parser should provide an indicator that a validation error was encountered during parsing.
It depends. If it also wants to act as a conformance checker that might be useful. If it's meant to parse URLs encountered on the web from Location
headers to <a href="...">
then probably not. You'd want the best performance possible and not have to do this additional bookkeeping.
Performance will actually improve if an optional parameter is provided by parsers that would allow callers to request to fail early, so bogus URLs wouldn't have to be parsed fully because callers know that in their context tabs and newlines are not valid characters.
Thanks for your insights and guidance. It is much appreciated. Again, apologies for going down the rabbit hole with browsers.
I think it would also be unwise for Python to deviate from this as it would mean it ends up with different results from web browsers.
Note that other software is already more strict than web browsers, such as curl and wget. And even browsers are not internally consistent, e.g. Firefox does not follow a Location: https://exa mple.com/
redirect to https://example.com.
I'm not sure curl
is necessarily stricter? It doesn't really follow any standard as far as I'm aware.
I can't follow your Firefox example link. That would be interesting to delve into further. Probably best in a new issue?
For posterity:
$ curl -I https://run.mocky.io/v3/abe4108f-192b-46a4-a6aa-7902802a7c1d
HTTP/1.1 302 Found
Location: https://exa mple.com/
Content-Type: text/plain; charset=UTF-8
Date: Fri, 16 Aug 2024 18:45:27 GMT
Content-Length: 0
Sozu-Id: 01J5E84HM63HESNCPC8E4M9STR
The Location
header field value (which is specified by RFC 9110 to be an RFC 3986 URI-reference) contains a tab character in the middle of the "example" label of host "example.com", making it invalid (i.e., not a URI-reference). Firefox and Safari therefore reject the response as invalid and display a local error page, while Chrome interprets it as a redirect to https://example.com (presumably applying the tab-stripping https://url.spec.whatwg.org/#concept-basic-url-parser algorithm of this spec).
@gibson042 interesting. For Safari that is a known bug that we plan on fixing in due course. Curious how Firefox ended up in that state though. It seems Firefox does some non-sanctioned escaping before parsing the URL: https://searchfox.org/mozilla-central/source/netwerk/protocol/http/nsHttpChannel.cpp#5750-5755. That might well be how it ends up in this state. This overall issue is tracked in whatwg/fetch#883.