bottlerocket-os/bottlerocket

Cilium-agent does not start after upgrading to bottlerocket OS 1.20.0

pasi-romo-idealo opened this issue · 1 comments

Image I'm using:

  • Bottlerocket OS 1.20.0 (aws-k8s-1.29)
    • ami-04a970ca64c4cc6a8
    • bottlerocket-aws-k8s-1.29-aarch64-v1.20.0-fcf71a47

What I expected to happen:

Cilium-agent starts succesfully to run on the node using the Bottlerocket OS.

What actually happened:

Cilium-agent fails to start due to following error:

level=info msg="Started gops server" address="127.0.0.1:9890" subsys=gops
level=info msg="Start hook executed" duration="448.745µs" function="gops.registerGopsHooks.func1 (cell.go:44)" subsys=hive
level=info msg="Start hook executed" duration="2.486µs" function="metrics.NewRegistry.func1 (registry.go:86)" subsys=hive
level=info msg="Establishing connection to apiserver" host="https://[fd4c:a5d3:71e::1]:443" subsys=k8s-client
level=info msg="Serving prometheus metrics on :9962" subsys=metrics
level=info msg="Connected to apiserver" subsys=k8s-client
level=info msg="Start hook executed" duration=17.289632ms function="client.(*compositeClientset).onStart" subsys=hive
level=info msg="Start hook executed" duration="16.206µs" 
function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Node].Start" subsys=hive
level=info msg="Start hook executed" duration=3.05529ms function="node.NewLocalNodeStore.func1 (local_node_store.go:77)" subsys=hive
level=info msg="Start hook executed" duration="141.581µs" function="authmap.newAuthMap.func1 (cell.go:28)" subsys=hive
level=info msg="Start hook executed" duration="64.264µs" function="configmap.newMap.func1 (cell.go:24)" subsys=hive
level=info msg="Start hook executed" duration="73.125µs" function="signalmap.newMap.func1 (cell.go:45)" subsys=hive
level=info msg="Start hook executed" duration="14.146µs" function="nodemap.newNodeMap.func1 (cell.go:24)" subsys=hive
level=info msg="Start hook executed" duration="100.498µs" function="eventsmap.newEventsMap.func1 (cell.go:36)" subsys=hive
level=warning msg="iptables modules could not be initialized. It probably means that iptables is not available on this system" error="could not load module iptable_raw: exit status

How to reproduce the problem:

We are using VCP CNI and cilium in chaining mode on an IPv6-only EKS-cluster.

The currently used versions:

  • cilium: 1.14.6 (installed via helm)
  • VPC CNI: v1.18.0-eksbuild.1 (installed as EKS addon)

The problem occured when we upgraded from ami-version bottlerocket-aws-k8s-1.29-aarch64-v1.19.5-64049ba8 to bottlerocket-aws-k8s-1.29-aarch64-v1.20.0-fcf71a47 (the same issue affects both arm and amd-instances).

We have done some investigation and debugging on the affected nodes and have found out that the issue is related to the ip6table modules.

The comparison between loaded modules on different OS-versions:

1.19.5:

bash-5.1# lsmod |grep ip6table
ip6table_raw           16384  1
ip6table_filter        16384  1
ip6table_nat           16384  10
nf_nat                 57344  5 ip6table_nat,xt_nat,iptable_nat,xt_MASQUERADE,xt_REDIRECT
ip6table_mangle        16384  1

1.20.0:

bash-5.1#  lsmod |grep ip6table
ip6table_filter        16384  1
ip6table_nat           16384  1
nf_nat                 53248  4 ip6table_nat,xt_nat,iptable_nat,xt_MASQUERADE
ip6table_mangle        16384  1

You can see the missing ip6table_raw module in the new version. Only after doing a modprobe manually it is in the list:

bash-5.1# modprobe ip6table_raw
bash-5.1# lsmod |grep ip6table
ip6table_raw           16384  0
ip6table_filter        16384  1
ip6table_nat           16384  1
nf_nat                 53248  4 ip6table_nat,xt_nat,iptable_nat,xt_MASQUERADE
ip6table_mangle        16384  1

After restarting cilium-agent it starts to run successfully and we also get a usage of the ip6table_raw module:

bash-5.1# lsmod |grep ip6table_raw
ip6table_raw           16384  1

Just noticed that this is actually a duplicate to this one: #3968