mtrudel/bandit

truncated responses with H2C (via AWS Application Load Balancer)

Opened this issue ยท 16 comments

I recently tried running a Bandit-based Phoenix application with H2C, behind an AWS Application Load Balancer. In this mode, I would almost never receive a full response, even for static files served by Plug.Static.

Some alternatives which did work, making me thing that the issue is with Bandit's H2C implementation:

  • Bandit via HTTP/1.1
  • Cowboy 2.12 via H2C

I don't have much in the way of logs at the moment, but if there's anything that would be helpful, I'm happy to give it a shot!

I'd love a repro case here, not of the application itself, but of the configuration it took within AWS to get here! I'm willing to bet it rhymes with #306 and/or #338; likely a bug in who / how content-length headers get set.

For us, it's:

  • Phoenix 1.7 app
  • Debian bookworm Docker container
  • AWS ECS Fargate runner
  • AWS Application Load Balancer
  • AWS Target group configured with the HTTP2 target version to talk to the ECS Fargate tasks

Good response:
Screenshot 2024-08-20 at 18 50 35

Bad response:
Screenshot 2024-08-20 at 18 50 45

Server logs:

023cd54e6718 22:47:13.750 [info] state=sent method=GET path=/assets/app-b49524acc1698e2440c53a73d41aec25.js params=[UNFETCHED] status=200 headers={"user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:129.0) Gecko/20100101 Firefox/129.0"} duration=0.248
023cd54e6718 22:47:14.649 [info] state=sent method=GET path=/assets/app-b49524acc1698e2440c53a73d41aec25.js params=[UNFETCHED] status=200 headers={"user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:129.0) Gecko/20100101 Firefox/129.0"} duration=0.234

From cURL:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Host XXX:443 was resolved.
* IPv6: (none)
* IPv4: XXX, XXX, XXX
*   Trying XXX:443...
* Connected to XXX (XXX) port 443
* ALPN: curl offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [104 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [4959 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256 / secp256r1 / rsaEncryption
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=XXX
*  start date: Mar 27 00:00:00 2024 GMT
*  expire date: Apr 25 23:59:59 2025 GMT
*  subjectAltName: host "XXX" matched cert's "XXX"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M03
*  SSL certificate verify ok.
*   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 2: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
{ [5 bytes data]
* using HTTP/2
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* [HTTP/2] [1] OPENED stream for XXX/assets/app-b49524acc1698e2440c53a73d41aec25.js?vsn=d
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: XXX]
* [HTTP/2] [1] [:path: /assets/app-b49524acc1698e2440c53a73d41aec25.js?vsn=d]
* [HTTP/2] [1] [user-agent: curl/8.9.1]
* [HTTP/2] [1] [accept: */*]
} [5 bytes data]
> GET /assets/app-b49524acc1698e2440c53a73d41aec25.js?vsn=d HTTP/2
> Host: XXX
> User-Agent: curl/8.9.1
> Accept: */*
>
* Request completely sent off
{ [5 bytes data]
< HTTP/2 200
< server: awselb/2.0
< date: Tue, 20 Aug 2024 22:59:26 GMT
< content-type: text/javascript
< content-length: 270919
< vary: Accept-Encoding
< cache-control: public, max-age=31536000, immutable
< accept-ranges: bytes
<
{ [8192 bytes data]
* HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)
 90  264k   90  238k    0     0   870k      0 --:--:-- --:--:-- --:--:--  870k
* Connection #0 to host XXX left intact
curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)

Thanks! That's super helpful. Any notable logs on the bandit side?

Nope, nothing from Bandit other than the two request logs. If it's helpful, they're from Logster.Plug and run as a Plug.Conn.register_before_send/2 callback.

I've been trying most of the afternoon to reproduce this and I can't manage to get it working. I can get a repro of a somewhat related issue that I'm noting on a separate comment here, but it manifests as 500s downstream and not truncated responses.

It would also be useful to know if:

  • You're absolutely sure that you're accessing Cowboy via h2c
  • What the content of the shorter response looks like. Is it the first n bytes of the expected response, a compressed version of it? Something else entirely?
  • Any chance you can provide access / connection logs for the failing responses?

In trying to repro this issue I managed to find a possibly related issue, wherein an ALB target group configured to use HTTP/2 fails to make multiple requests on a single upstream connection. Some details:

  • Set up an EC2 instance and run any old phoenix app on it
  • Set up an ALB load balancer that serves HTTPS (HTTPS is needed to be able to send h2c upstream). Certificate can be self-signed
  • Configure target group to connect via HTTP/2 to the instance. Be careful about AZ configuration to avoid mysterious 502 errors
  • A sizeable chunk of requests will be returned to the client as 502s. Investigation shows that these all correlate to streams greater than '1' (ie: to subsequent upstream requests on a given connection). Bandit sees these responses as being normal. There is nothing useful in ALB's connection or access logs
  • Response size doesn't seem to affect this; even tiny responses induce the error

Unsure how to investigate this further. Without visibility into ALB's upstream client code, or into the errors it's seeing, we're kinda going blind here. If I had to hunch based on the symptoms at hand it may be related to response header compression contexts, since this is one of the only things shared between streams within in a connection that would affect subsequent streams and not the original one, and also doesn't vary with response size as windows do.

Got it. It's a header compression bug of some sort, likely upstream in HPAX (or possibly in AWS' HTTP/2 client implementation). Working to characterize it more precisely but it's 100% reproducible.

Here's a tcpdump capture of a hacked up version of bandit that doesn't return any headers (other than the :status: 200 pseudo header which is statically defined in HPACK). Note that the client keeps the connection open, as indicated by its continuing to make subsequent requests on streams within the same connection. So obviously, the ALB client has no issues with this.

Now, here's a tcpdump capture of a an otherwise identical version of bandit that adds two copies of the header foo: bar to the same request. You can see that ALB client resets the connection immediately, suggesting that it sees something wrong here that doesn't come up in the previous case.

(FWIW, I've also run tests with another statically defined header (accept-encoding: gzip, deflate) and it also succeeded. So it's almost certainly something with either the way we encode literal headers, or the way ALB decodes them).

Of note, wireshark only shows one copy of the foo: bar header in its decode (and in the case of subsequent stream requests with similar headers, omits them there as well). This suggests that wireshark may have issues with our encoding too.

Fixed it. We were resizing the HPAX compression table contents but not its max size when we received client-side settings frames (this manifests here because the ALB client sets its max header table size to zero as part of its initial settings sync).

A hacky fix is up at https://github.com/mtrudel/bandit/tree/aws_investigation if you're so inclined to test. I'm gonna bring this up with @whatyouhide to see if this makes more sense to have as a function on HPAX itself before landing a proper fix on Bandit.

To that end: @whatyouhide, I think it makes more sense to have a HPAX.change_max_size/2 function that both evicts from the table and also updates the max table size as indicated. It could also make sense as a flag to HPAX.resize/2. Happy to raise an issue over there / do the work for this. Created elixir-mint/hpax#18 to better capture the discussion

I've also failed to switch ALB to h2 when talking to bandit in a very similar setup (ALB to ECS containers) and might be able to test this.

Can confirm it works in my setup AWS ALB with HTTP/2 enabled for the backend to a target group of ECS containers.

Can confirm it works in my setup AWS ALB with HTTP/2 enabled for the backend to a target group of ECS containers.

Great to hear a second opinion! I'm going to keep working it over on HPAX

Great to hear a second opinion! I'm going to keep working it over on HPAX

My app is using phoenix channels / live_view and upgrade to wss over h2c returns 464. I guess bandit does not support websockets over h2c?

My app is using phoenix channels / live_view and upgrade to wss over h2c returns 464. I guess bandit does not support websockets over h2c?

Correct. RFC 8441 support is coming 'soon' (in fact, most of the refactoring work I've been doing for the past several months has been with the express purpose of supporting it). No promises, but likely some time this winter.

Correct. RFC 8441 support is coming 'soon' (in fact, most of the refactoring work I've been doing for the past several months has been with the express purpose of supporting it). No promises, but likely some time this winter.

Sorry, I guess the wording was a bit off, It was not a feature request ๐Ÿ™‚

It's a missing piece to support phoenix apps behind a proxy doing TLS termination.