aws/amazon-vpc-cni-k8s

IPv6 containers experience connectivity issues with large simultaneous file downloads

chen-anders opened this issue · 6 comments

What happened:

Observed behavior is that large simultaneous downloads stall out and eventually we receive a "connection reset by peer" error. Sometimes, we also see TLS connection errors and DNS resolution errors, which cause some downloads to immediately error out.

These errors only affect downloads from IPv6 servers/endpoints. IPv4 works perfectly fine.

Example error output

Sometimes we see errors around establishing connections over HTTPS:

test9 | Connecting to embed-ssl.wistia.com (embed-ssl.wistia.com)|2600:9000:244d:7800:1e:c86:4140:93a1|:443... connected.
test9 | Unable to establish SSL connection.
test9 | exit status 4
test3 | Resolving embed-ssl.wistia.com (embed-ssl.wistia.com)... failed: Try again.
test3 | wget: unable to resolve host address 'embed-ssl.wistia.com'
test3 | exit status 4

We host-mounted the CNI logs on the hosts we performed the testing, but didn't see any associated logs during our testing.

What you expected to happen:

Downloads complete without connection errors

How to reproduce it (as minimally and precisely as possible):

We have a Procfile that runs 9 downloads of a 700MB file in parallel.

Debian Slim Container

Launch a container: kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/debian/debian:bullseye-slim --command -- bash

apt-get update && apt-get install -y wget
ARCH="$(arch | sed s/aarch64/arm64/ | sed s/x86_64/amd64/)"
wget https://github.com/wistia/hivemind/releases/download/v1.1.1/hivemind-v1.1.1-wistia-linux-$ARCH.gz
gunzip hivemind-v1.1.1-wistia-linux-$ARCH.gz
mv hivemind-v1.1.1-wistia-linux-$ARCH hivemind
chmod +x hivemind
wget https://raw.githubusercontent.com/wistia/eks-ipv6-reset-example/main/Procfile
./hivemind -W Procfile

Alpine Container

Launch a container: kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/docker/library/alpine:3.19.1 --command -- ash
`

apk add wget # use non-busybox wget
ARCH="$(arch | sed s/aarch64/arm64/ | sed s/x86_64/amd64/)"
wget https://github.com/wistia/hivemind/releases/download/v1.1.1/hivemind-v1.1.1-wistia-linux-$ARCH.gz
gunzip hivemind-v1.1.1-wistia-linux-$ARCH.gz
mv hivemind-v1.1.1-wistia-linux-$ARCH hivemind
chmod +x hivemind
wget https://raw.githubusercontent.com/wistia/eks-ipv6-reset-example/main/Procfile
./hivemind -W Procfile

Anything else we need to know?:

Environment is a dualstack IPv4/IPv6 VPC. We've been able to reproduce this on both nodes on public/private subnets.

Environment:
Kubernetes Versions:

  • 1.28.5 (eks.7) w/ kube-proxy v1.28.2-eksbuild.2
  • 1.29.0 (eks.1) w/ kube-proxy v1.29.0-eksbuild.2

Reproduced across AL2/Ubuntu/Bottlerocket with Kernel versions via EKS Managed Nodegroups:

-AL2: 5.10.209-198.858.amzn2.aarch64 / 5.10.209-198.858.amzn2.x86_64

  • Ubuntu 22: 6.2.0-1017-aws #17~22.04.1-Ubuntu SMP
  • Ubuntu 20: 5.15.0-1048-aws #53~20.04.1-Ubuntu SMP
  • Bottlerocket: 1.18.0-7452c37e , 1.19.2-29cc92cc

Reproduced on AWS VPC CNI versions:

  • v1.16.3-eksbuild.2
  • v1.15.1-eksbuild.1

Instance types used:

  • m6g.xlarge
  • c6g.xlarge
  • m7a.8xlarge
  • m6a.8xlarge

@chen-anders I suggest filing an AWS support case here, as the complexity for this issue will likely require debug sessions and cluster access.

In the meantime, I recommend collecting the node logs from the AL2 reproduction by executing the following bash script: https://github.com/awslabs/amazon-eks-ami/blob/main/log-collector-script/linux/eks-log-collector.sh

Hi @jdn5126 ,

We don't have an AWS support plan that would allow us to file a technical support case. In the meantime, I'm going to be working with the team to try to get the requisite logs you're asking for. The production workload currently runs on Bottlerocket OS - is there a similar log collector script we can use there?

Hi @jdn5126 ,

We don't have an AWS support plan that would allow us to file a technical support case. In the meantime, I'm going to be working with the team to try to get the requisite logs you're asking for. The production workload currently runs on Bottlerocket OS - is there a similar log collector script we can use there?

I see that Bottlerocket has a section on logs: https://github.com/bottlerocket-os/bottlerocket#logs, but it does not look like it collects everything that we would need. I wonder if we can use the same strategy laid out there to execute the EKS AMI bash script

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

acj commented

Sorry for the delay on our end. We're still planning to collect and share logs.

acj commented

We've repeated our tests over the past few days and are not able to repro the download stall anymore. We haven't made any related changes to our infrastructure and are still puzzled by the behavior.

A few notes for anyone who might run into the same problem:

  • Downloads seemed to stall more frequently on Bottlerocket- and Ubuntu-based EKS worker nodes than on AL2-based ones
  • We think we were able to repro the issue (it was very similar, at least) in March on bare EC2 instances running Ubuntu, so it's unclear whether this was a VPC CNI issue at all
  • The stalls seemed vaguely correlated with network load, happening somewhat more frequently when load was heavy

Hopefully this is resolved. Thanks for your help!