egressgateway
Abstract
The gateway provides network egress capabilities for Kubernetes clusters.
- Solve IPv4 IPv6 dual-stack connectivity.
- Solve the high availability of Egress Nodes.
- Allow filtering Pods Egress Policy (Destination CIDR).
- Allow filtering of egress Applications (Pods).
- Can be used in low kernel version.
Background
Starting with 2021, we received some feedback as follows.
There are two clusters A and B. Cluster A is VMware-based and runs mainly Database workloads, and Cluster B is a Kubernetes cluster. Some applications in Cluster B need to access the database in Cluster A, and the network administrator wants the cluster Pods to be managed through an egress gateway.
Proposal
CRDS
The egress gateway model abstracts three Custom Resource Definitions (CRDs): EgressNode
, EgressNode
and EgressGatewayPolicy
. They are cluster scoped CRDs.
EgressGateway
apiVersion: egressgateway.spidernet.io/v1
kind: EgressGateway
metadata:
name: "egressgateway"
spec:
nodeSelector:
matchLabels:
egress: "true"
status:
forwardMethod: "active-passive"
nodeList:
- node1:
status: "ready"
active: true
interfaces:
- eth0:
ipv4: ["10.6.0.10/16"]
ipv6: ["fd::10/64"]
- spec
nodeSelector
field matching against node labels.
- status
forwardMethod
field sync formConfigMap
configuration.nodeList
field is the list of nodes matched bynodeSelector
status
field represents the node status, which may beReady
,NotReady
orUnknown
.- Only nodes in the
Ready
state can participate in the election of egress gateway nodes.
- Only nodes in the
avtive
field represents that the non-egress gateway is reconcile or reconcile completes accessing the destination CIDR(e.g. Cluster A CIDR in picture 1) with this node.interfaces
is physical network interface list. It is updated by the Agent.ipv4
address list.ipv6
address list.
EgressNode
apiVersion: egressgateway.spidernet.io/v1
kind: EgressNode
metadata:
name: "node1"
spec:
status:
phase: "Succeeded"
vxlanIPv4IP: "172.31.0.10/16"
vxlanIPv6IP: "fe80::/64"
tunnelMac: "xx:xx:xx:xx:xx"
physicalInterface: "eth1"
physicalInterfaceIPv4: ""
physicalInterfaceIPv6: ""
The EgressNode
CRD stores vxlan tunnel information, which is generated by the Controller from the Node CR.
- status
phase
indicates the status of EgressNode. If 'Succeeded' has been assigned and the tunnel has been built, 'Pending' is waiting for IP assignment, 'Init' succeeds in assigning the tunnel IP address, and 'Failed' fails to assign the tunnel IP address.vxlanIPv4IP
field represents the IPv4 address of VXLAN tunnel.vxlanIPv6IP
field represents the IPv6 address of VXLAN tunnel.tunnelMac
field represents the MAC address of IPv4 VXLAN tunnel Interface.physicalInterface
is parent name of VXLAN tunnel interface.physicalInterfaceIPv4
is parent IPv4 Address of VXLAN tunnel interface.physicalInterfaceIPv6
is parent IPv6 Address of VXLAN tunnel interface.
EgressGatewayPolicy
apiVersion: egressgateway.spidernet.io/v1
kind: EgressGatewayPolicy
metadata:
name: "policy"
spec:
appliedTo:
podSelector:
matchLabels:
app: "shopping"
ipv6PodSubnet: "10.0.0.0/16"
ipv4PodSubnet: "10.0.0.0/16"
destCIDR:
- "10.6.1.0/24"
- spec
podSelector
filed selects the grouping of pods to which the policy applies.podSubnet
field specifies the pod CIDR affected by the egress policy. It conflicts with thepodSelector
field.destCIDR
destination CIDR block list.
Datapath
A combination of vxlan tunnel, ipset, iptables, route is required to complete policy control.
Non Egress Node
VXLAN
Build a VXLAN tunnel on cluster nodes. There are 2 tunnel NICs named egress-vxlan-v4
and egress-vxlan-v6
.
IPSet
sudo ipset create egress-dst-policy-name
sudo ipset add egress-dest-policy-name 172.16.1.1/32
IPTables
iptables -t mangle -F EGRESSGATEWAY-MARK-REQUEST-POLICY-NAME
iptables -t mangle -X EGRESSGATEWAY-MARK-REQUEST-POLICY-NAME
iptables -t mangle -N EGRESSGATEWAY-MARK-REQUEST-POLICY-NAME
iptables -A EGRESSGATEWAY-MARK-REQUEST-POLICY-NAME \
-t mangle \
-m conntrack --ctdir ORIGINAL \
-m set --match-set egress-dst-policy-name dst \
-m set --match-set egress-src-policy-name src \
-j MARK --set-mark 0x11000000 \
-m comment --comment "rule uuid: mark request packet"
Route
Normal.
ip rule add fwmark 0x11000000 table 100
ip route f table 100
ip route add default via 20.0.0.85 dev egress-vxlan-v4 onlink table 100
Equal-cost multi-path routing.
sysctl -w net.ipv4.fib_multipath_hash_policy=1
ip rule add fwmark 0x11000000 table 100
ip route f table 100
ip route add table 100 default \
nexthop via 20.0.0.85 dev egress-vxlan onlink \
nexthop via 20.0.0.90 dev egress-vxlan onlink
Egress Node
iptables -t mangle -I FORWARD 1 -m mark --mark 0x11000000 -j MARK --set-mark 0x12000000 -m comment --comment "egress gateway: change mark"
iptables -t filter -I FORWARD 1 -m mark --mark 0x12000000 -j ACCEPT -m comment --comment "egress gateway: keep mark"
iptables -t filter -I OUTPUT 1 -m mark --mark 0x12000000 -j ACCEPT -m comment --comment "egress gateway: keep mark"
iptables -t mangle -I POSTROUTING 1 -m mark --mark 0x12000000 -j ACCEPT -m comment --comment "egress gateway: keep mark"
iptables -t nat -I POSTROUTING 1 -m mark --mark 0x12000000 -j ACCEPT -m comment --comment "egress gateway: no snat"
CNI Compatibility
Calico
Required settings chainInsertMode
to Append
, for example in the code, more reference calico docs:
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
name: default
spec:
ipv6Support: false
ipipMTU: 1400
chainInsertMode: Append
Implementation
Controller
Controller consists of Webhook Validator and Reconcile Flow.
Controller has 2 control processes, the first Watch cluster nodes, generate tunnel IP address and MAC address for Node, then Create
or Update
EgressNode CR Status. The second control flow watch EgressNode
and Egressgateway
, sync match node list from labelSelector
, election egress gateway node.
Agent
Agent has two control processes, the first Watch EgressNode
CR, which manages node tunnel, and node tunnel is a pluggable interface that can be replaced by Geneve. The second control process manages datapath policy, which watches EgressNode
, EgressGateway
and Egresspolicy
, and sends them to the host through the police interface. It is currently implemented by a combination of ipset, iptables, and route, and it can be replaced by eBPF.
Go Package (Structure) Design
├── api
│ └── v1
├── charts
├── cmd
│ ├── agent
│ │ ├── cmd
│ │ │ └── root.go
│ │ └── main.go
│ └── controller
│ ├── cmd
│ │ └── root.go
│ └── main.go
├── docs
├── images
├── output
├── pkg
│ ├── config
│ │ └── config.go
│ ├── agent
│ │ ├── agent.go
│ │ ├── egress_gateway_node.go
│ │ ├── egress_node.go
│ │ ├── egress_police.go
│ │ ├── iptables
│ │ │ └── iptables.go
│ │ ├── route
│ │ │ └── route.go
│ │ └── vxlan
│ │ └── vxlan.go
│ ├── controller
│ │ ├── allocator
│ │ │ └── interface.go
│ │ ├── controller.go
│ │ ├── controller_test.go
│ │ ├── egress_gateway_node.go
│ │ ├── node.go
│ │ └── webhook
│ │ ├── mutating.go
│ │ └── validate.go
│ ├── ipset
│ │ ├── ipset.go
│ │ └── types.go
│ ├── k8s
│ ├── lock
│ ├── logger
│ ├── metrics
│ ├── profiling
│ ├── schema
│ └── types
├── test
├── tools
└── vendor
develop
Refer to develop.