Correctly handle utf8 BOM at beginning of str content
mgautierfr opened this issue · 4 comments
tinycss2.parse_stylesheet
fails to correctly parse content starting with a U+FEFF
(ZERO WIDTH NO-BREAK SPACE
)
See for example the css file at https://donorbox.org/assets/application_embed-47da8f7456acb6aa58b61f2e5c664fccbf3cae5b0ad587f129dcd2d93caa65e8.css
>>> import requests
>>> import tinycss2
>>> css = requests.get("https://donorbox.org/assets/application_embed-47da8f7456acb6aa58b61f2e5c664fccbf3cae5b0ad587f129dcd2d93caa65e8.css").content
>>> parsed_from_bytes, encoding = tinycss2.parse_stylesheet_bytes(css)
>>> encoding.name
'utf-8'
>>> parsed_from_bytes[0].serialize()
'@import url(//fonts.googleapis.com/icon?family=Material+Icons);'
>>> parsed_from_utf8 = tinycss2.parse_stylesheet(css.decode('utf-8'))
>>> parsed_from_utf8[0].serialize()
'\ufeff@import url(//fonts.googleapis.com/icon?family=Material+Icons);@import url(//code.getmdl.io/1.1.1/material.indigo-pink.min.css);h3{--font-semibold: 600\n}'
>>> parsed_from_detected = tinycss2.parse_stylesheet(encoding.codec_info.decode(css)[0])
>>> parsed_from_detected[0].serialize()
'\ufeff@import url(//fonts.googleapis.com/icon?family=Material+Icons);@import url(//code.getmdl.io/1.1.1/material.indigo-pink.min.css);h3{--font-semibold: 600\n}'
>>> parsed_from_utf8_sig = tinycss2.parse_stylesheet(css.decode('utf-8-sig'))
>>> parsed_from_utf8_sig[0].serialize()
'@import url(//fonts.googleapis.com/icon?family=Material+Icons);'
>>> decoded = css.decode('utf8')
>>> decoded[0]
'\ufeff` # (ZERO WIDTH NO-BREAK SPACE)
I'm not sure if this is really a bug as https://drafts.csswg.org/css-syntax-3 do not specify anything about U+FEFF
(as so, I assume it should not be handle as whitespace).
However, even if BOM is discouraged, it is kind of "normal" situation. It may be good to discard a ZERO WIDTH NO-BREAK SPACE
as a space, at least at beginning of content.
Hi @mgautierfr,
Glad to have some news!
I'm not sure if this is really a bug as https://drafts.csswg.org/css-syntax-3 do not specify anything about
U+FEFF
(as so, I assume it should not be handle as whitespace).
That’s … complicated.
CSS Syntax explains how to handle BOMs when reading bytes, using this algorithm. You then get a stream of Unicode code points, that are used to parse CSS. The algorithm removes the BOM, so the Unicode stream doesn’t include it.
As far as as can tell, the BOM isn’t allowed at the beginning of the Unicode stream: whitespaces, as defined by CSS, only include the common space character, newline tokens (\n
, \r\n
, \r
, \f
) and the tabulation character.
TinyCSS2 provides 2 interfaces: parse_stylesheet_bytes
and parse_stylesheet
. parse_stylesheet
works with a Unicode str
, that is the equivalent to the stream of Unicode code points described by the CSS specification, and so shouldn’t allow BOMs. parse_stylesheet_bytes
handles bytes and implements the decoding algorithm correctly for you, by removing the BOM if it’s there.
So, this parsing problem seems to be expected … at least in my opinion. 😄 To avoid BOM and encoding problems, you should use parse_stylesheet_bytes
as explained in the documentation.
If you really want to give Unicode strings, you can for example use Requests’ decoding feature by using requests.get(…).text
that will decode the bytes according to the HTTP headers and specification. Unfortunately, in your case, there’s a bug in the HTTP server configuration:
response = requests.get("https://donorbox.org/assets/application_embed-47da8f7456acb6aa58b61f2e5c664fccbf3cae5b0ad587f129dcd2d93caa65e8.css")
print(response.encoding) # 'ISO-8859-1'
Obviously, your file is actually UTF-8. But with a text/*
content type header and no explicit encoding provided by the server, the default encoding is ISO-8859-1 according to the specification. So you won’t get the right Unicode string from Requests, with both the BOM (that would be removed by Requests if encoding in HTTP header was UTF-8) and the wrong encoding.
So… 😄
My conclusions are:
- If you decode the
.content
value manually, you’re on your own to use the right encoding and to remove the BOM (☠️ you don’t want to do that!) - There’s a problem on the server, that should return a HTTP header telling that it’s UTF-8 content.
- This problem prevents Requests from setting the right value in
.text
, that you could otherwise be used withparse_stylesheet
. - Requests won’t "fix" this "problem" (see psf/requests#654.)
- There’s no perfect workaround for this kind of problem, that is unfortunately quite common. If you don’t want to rely on the server’s encoding information, you can give
.content
toparse_stylesheet_bytes
and hope that the encoding will be automatically detected from the content. If you want to rely on the server’s encoding information, then give.text
toparse_stylesheet
and hope that the server is correctly configured.
What do you think? (We could even talk about this, there’s a meetup next Thursday 😁!)
I was expecting this conclusion. I think the safer is to use parse_stylesheet_bytes
.
If we pass protocol_encoding=response.encoding
to parse_stylesheet_bytes
, will it be ok or it will mislead tinycss ?
(We could even talk about this, there’s a meetup next Thursday 😁!)
Will be there :)
If we pass
protocol_encoding=response.encoding
toparse_stylesheet_bytes
, will it be ok or it will mislead tinycss ?
That’s a bad idea, because it will use the default 'ISO-8859-1' HTTP encoding for text/*
on web servers where an explicit encoding is missing (which is often the case, and is exactly your use case.) Letting TinyCSS2 finding the correct encoding is probably a better choice.
If you want to do something really smart (and probably useless, as many really smart things), in case your CSS is usually linked by a website, you can use the website’s HTML encoding as environment encoding. Something like:
html_response = requests.get("https://donorbox.org/") # encoding is utf-8
css_response = requests.get("https://donorbox.org/assets/application_embed-47da8f7456acb6aa58b61f2e5c664fccbf3cae5b0ad587f129dcd2d93caa65e8.css")
parsed_from_bytes, encoding = tinycss2.parse_stylesheet_bytes(css_response.content, environment_encoding=html_response.encoding)
That’s more or less what’s in the HTML specification and what browsers do.
Will be there :)
\o/
If you want to do something really smart (and probably useless, as many really smart things), in case your CSS is usually linked by a website, you can use the website’s HTML encoding as environment encoding.
Well, we already doing too many (un)smart things with our content (We can indeed speak about this Thursday). Let's not be too smart. If tinycss2 can handle the encoding, I will let it do :)
I will move to parse_stylesheet_bytes
. Thanks !