aws/aws-sdk-cpp

utf8, default locale, HTTP header values and user-defined metadata

crusader-mike opened this issue · 4 comments

A non-US-ASCII symbol slipped into user-defined metadata (see PutObjectRequest::SetMetadata, etc) and caused uploads to fail with Windows error code 87 (returned by WinHttpAddRequestHeaders() call). I spent some time digging and found that:

  • debug build simply crashes (asserts in isspace()/etc)

  • user-provided values are simply forwarded to WinHttpAddRequestHeaders() without any validation, even though MSDN mentions that:

... is intended for use by sophisticated clients that require detailed control over the exact request sent to the HTTP server.
The name and value of request headers added with this function are validated. Headers must be well formed. For more information about valid HTTP headers, see RFC 2616.

(WinHTTP validation doesn't seem to do much more than checking that every character code is <= 127. I certainly hope they check for double end-of-line or this could be a vector for some sort of attack. Surprisingly libcurl doesn't care about chars with code > 127)

  • so, I got curious -- what can be used in HTTP header value? According to my reading of HTTP/1.1 standard -- any char can be used (see this for details). And yet S3 service I was using returned 400 for most non-printable US-ASCII chars and failed with 'InvalidSignature' for chars > 127.

(one cute exception though -- code 127 went through just fine, but all .Net clients couldn't work well with that S3 service after that -- any response carrying this metadata caused .Net Framework to have a stroke :) )

  • now besides HTTP header limitations S3 is supposed to impose additional ones. And official page on that is pretty vague (see at the bottom). It mentions that metadata can be non-ASCII and what is going to happen if such metadata is retrieved via REST, but then it goes ahead and mentions US-ASCII and UTF-8. Whatever those requirements are -- this SDK does not enforce or check for them. I would appreciate if someone clarify this portion for me, please...

Anyway, here are my notes after spending some time reading related portions of SDK code:

  1. it seems that SDK expects user-provided strings to be utf-8 encoded (for example char* -> LPWSTR conversion assumes utf8 encoding for incoming data)

  2. ... and yet code everywhere is using CRT functions that use default locale (e.g. atoll, isspace, tolower, etc) -- this normally works on Linux (because it uses utf-8 locales by default), but does not work on Windows (or other Unix flavors). This explains asserts in debug build. You may counter that these strings are expected to be pure ASCII, but you don't validate them to be (and use UTF-8 -> UCS-2 conversion). And generally -- if you expect string to be encoded in utf-8, you can't just blindly use locale-sensitive functions because current locale could be smth crazy like French_Canada.Big5 -- you have no control over it.

  3. you can't use locale-sensitive functions to prepare data for HTTP header. HTTP standard defines very narrowly what space (and other special chars) are -- and locale is flexible. SDK doesn't set locale as far as I know. Related operations will probably have to be hand-coded

  4. by extension -- everything std::facet-related shouldn't be used too. In particular those tricks with std::stringstream and getline() -- they end up using widen('\n') which depends on locale

  5. user-provided metadata needs to be validated against both HTTP standard- and S3 API requirements

  6. if my reading of HTTP standard is correct -- WinHTTP present a problem: it doesn't let any char > 127 to be passed in header value; libcurl is fine

  7. this one is esoteric: (if my understanding is correct) even char with code 0 can be used in header -- this means you can't use strlen() in related conversion functions, as you'll end up truncating values.

Disclaimer: I haven't dealt with locale-related stuff for long time -- could be wrong...

Forgot to mention this one (thought it should probably be reported elsewhere):

  1. AWS Signature v4 generation process was obviously designed around the requirements that HTTP standard (see page 31) puts on server -- namely the fact that server/proxy can replace consecutive LWS (linear-white-space) in header value with one SP (space, code 32) and server behavior should not change. From related instructions (see step 4):

The Trimall function removes excess white space before and after values, and converts sequential spaces to a single space.

problem is that HTTP standard doesn't allow to replace just any space -- it allows to replace only LWS'es and only if they happen between field-content entries:

Any LWS that occurs between field-content MAY be replaced with a single SP before interpreting the field value or forwarding the message downstream.

and LWS is not just any whitespace, it is defined as [CRLF] 1*( SP | HT ) -- i.e. sequence of space(32) or tab(9) codes optionally prefixed by one CRLF.

This means two things:

  • AWS algorithm is overly eager and can remove too many whitespaces (including those that are not LWS) -- thus giving me more flexibility to change header without affecting signature

  • AWS algorithm will remove whitespaces inside of field-content entries that can be used in HTTP headers -- this permits material changes to these entries without changing signature. Example:

my-header: "aaa" , "ccc", "ddd"

HTTP stipulates that LWS between these quoted strings can be removed, but not LWS inside of quoted strings. AWS stipulates that any space including those inside of quoted strings will be removed. And it doesn't even specify what "space" is! (isspace() result depends on current locale).

FYI -- I ended up simply url-encoding every metadata value -- this allows me to dodge (almost) all mentioned problems... With exception of 'very exotic locale' case (i.e. when locale is not US-ASCII-compatible). And, luckily for me, our product isn't used with such locales.

You still should consider addressing issues raised above, though... If you ever use C++17 -- consider using locale-independent functions (e.g. from_char), where they are available. In other cases -- either (carefully!) configure locale or hand-code this stuff.

From the RFC (emphasis mine)

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets
. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.

With that in mind, I don't think we should attempt to handle UTF-8 characters that happen to be outside the ASCII charset.

Even if we try to detect non-ascii characters in header values, especially ones that were input by users, and url-encode them (which is the right approach), we will hit a snag if those values were read by a different SDK/client.

I think this is one of those situations in which it's best for the end-user to determine what scheme to encode/decode their values (if any) and for the SDK to have minimal involvement in massaging the input.

Newly defined header fields SHOULD limit their field values to US-ASCII octets.

With that in mind, I don't think we should attempt to handle UTF-8 characters that happen to be outside the ASCII charset.

DEL (code 127) is an US-ASCII octet and yet I see problems with existing HTTP server/client implementations. So is NUL (code 0) and other control chars. Now, HTTP server/client implementation is kind of off-topic here, but problem here is that HTTP header limitations matter wrt to metadata because SDK (a) does not check or enforce limitations imposed by S3 API (for user-defined metadata) and (b) does not check or enforce limitations imposed by HTTP standard (for headers). It simply passed through whatever values user gave to it. Unfortunately, existing documentation is unclear on what is a valid metadata. In my opinion, (ideally) SDK should enforce something (i.e. reject data invalid from S3 or HTTP persective).

I assert that locale-related points raised in original post (point 2, 3, 4 and 7) also still stand and need to be addressed before you could close this "question".

point 8 (signature generation algorithm) needs to be reported to a team reponsible for S3 (I guess) and either related procedure changed or necessary clarifications made to page describing the process. And, if procedure is to be changed -- this SDK code will have to be updated too.

On the other hand, I found a solution that allowed me to dodge most of these and don't care anymore, it is up to you what to do about these findings. :-)