Incorrect FloatingIP workflow
serge-name opened this issue · 7 comments
/kind bug
What steps did you take and what happened:
I tried capo build for 1d5d2d5e45462dab056e37a6c948361e81875ea9
. Some key details follow:
- Created a
OpenStackFloatingIPPool
(non-relevant fields removed)
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: OpenStackFloatingIPPool
metadata:
name: osfipp
spec:
floatingIPNetwork:
id: c7c8509d-7083-41c9-b799-e30e855e9bc0
reclaimPolicy: Delete
# …
- created a
MachineDeployment
andOpenStackMachineTemplate
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: OpenStackMachineTemplate
metadata:
name: some
spec:
template:
spec:
ports:
- network:
id: f16855bf-8ba1-4f75-ad8c-763e80134571
floatingIPPoolRef:
apiGroup: infrastructure.cluster.x-k8s.io/v1beta1
kind: OpenStackFloatingIPPool
name: osfipp
# …
✅ Floating IP was successfully created. Here we get correct data fip.FloatingIP == "185.***.**.**", fip.FloatingNetworkID == "c7c8509d-7083-41c9-b799-e30e855e9bc0"
:
cluster-api-provider-openstack/controllers/openstackmachine_controller.go
Lines 440 to 443 in 1d5d2d5
❌ Here we get port == nil
and an error "Failed while associating ip from pool: port for floating IP "185...*" on network c7c8509d-7083-41c9-b799-e30e855e9bc0 does not exist":
cluster-api-provider-openstack/controllers/openstackmachine_controller.go
Lines 450 to 458 in 1d5d2d5
More details follow.
Here:
Openstack API returns the following (non-relevant fields skipped):
{
"ports": [
{
"device_id": "d1b99e45-991c-4143-93a3-9a8d3eddb416",
"device_owner": "compute:nova",
"fixed_ips": [
{
"ip_address": "10.21.10.29",
"subnet_id": "616388c0-519f-418e-80b4-3687a546a65e"
}
],
"id": "0d1fe3bd-55f6-41d0-b879-a4071a15b5c0",
"network_id": "f16855bf-8ba1-4f75-ad8c-763e80134571"
// …
}
]
}
Please notice that we don't have a port associated with FIP network c7c8509d-7083-41c9-b799-e30e855e9bc0
. And both FIP network ID and the FIP itself are not going to appear in the ports info because in our Openstack cloud floating IPs are not being added to ports directly. But NAT 185.***.**.**
→ 10.21.10.29
would be set up.
If the new k8s node got FIP it could be found here:
https://compute-api:8774/v2.1/TENANT_ID/servers/d1b99e45-991c-4143-93a3-9a8d3eddb416
And the reply might be looking like this (non-relevant fields skipped):
{ "server": {
"id": "d1b99e45-991c-4143-93a3-9a8d3eddb416",
"hci_info": {
"network": [
{
"ips": [
"10.21.10.29"
],
"network": {
"id": "f16855bf-8ba1-4f75-ad8c-763e80134571",
"subnets": [
{
"ips": [
{
"address": "10.21.10.29",
"type": "fixed",
"version": 4,
"floating_ips": [
{
"address": "185.***.**.**",
"type": "floating",
"version": 4,
}
]
} ] } ] } } ] } } }
Here it tries to find a fixed IP in the FIP network but in our openstack cloud all FIPs have device_owner == "network:floatingip"
so it gets just an empty list:
cluster-api-provider-openstack/pkg/cloud/services/networking/port.go
Lines 71 to 76 in 1d5d2d5
What did you expect to happen:
Successfully deployed k8s node with FIP attached.
Anything else you would like to add:
None so far. But please ask me any details. The issue is reproducible and I can add even more details if you want.
Environment:
-
Cluster API Provider OpenStack version (Or
git rev-parse HEAD
if manually built):1d5d2d5e45462dab056e37a6c948361e81875ea9
-
Cluster-API version:
1.6.3
-
OpenStack version: Virtuozzo (https://virtuozzo.com), based on Openstack Xena
-
Minikube/KIND version: N/A
-
Kubernetes version (use
kubectl version
):1.29.3
-
OS (e.g. from
/etc/os-release
): Talos (https://talos.dev)1.6.7
What does f16855bf-8ba1-4f75-ad8c-763e80134571 look like, does it have a router?
It's not really documented, but we don't create any new ports for the FIPs, we just look for an existing port that the FIP can be attached to by checking if there's a port with a subnet that has an attached router to the floating ip network.
I've mostly tested it out with spec.ports omitted with the default setup, but I can test it out with something closer to your setup if I know more about how that network is setup.
Yes, I meant that the new port is being created by Openstack. But not in our cloud. I'm not so familiar with Openstack internals and don't have an access to different configurations except our particular cloud.
GET https://compute-api:9696/v2.0/networks/f16855bf-8ba1-4f75-ad8c-763e80134571
{ "network": { "id": "f16855bf-8ba1-4f75-ad8c-763e80134571", "name": "internal", "tenant_id": "278fda03174b4fee9358559baffca010", "admin_state_up": true, "mtu": 8913, "default_vnic_type": null, "status": "ACTIVE", "subnets": [ "616388c0-519f-418e-80b4-3687a546a65e" ], "shared": false, "availability_zone_hints": [], "availability_zones": [ "nova" ], "ipv4_address_scope": null, "ipv6_address_scope": null, "router:external": false, "description": "", "port_security_enabled": true, "rbac_policies": [ { "id": "c869c7ef-3c51-4fb6-88f5-c591989fe3ef", "action": "access_as_shared", "target_tenant": "d278dea8631e47ffba5a908265968fbb" } ], "qos_policy_id": null, "tags": [], "created_at": "2024-02-06T12:43:10Z", "updated_at": "2024-03-20T20:39:09Z", "revision_number": 5, "project_id": "278fda03174b4fee9358559baffca010", "provider:network_type": "vxlan" } }
GET https://compute-api:9696/v2.0/routers/7142d8f1-2b11-4ae2-a343-eacd77a2ceee
{ "router": { "id": "7142d8f1-2b11-4ae2-a343-eacd77a2ceee", "name": "DefaultRouter", "tenant_id": "278fda03174b4fee9358559baffca010", "admin_state_up": true, "status": "ACTIVE", "external_gateway_info": { "network_id": "c7c8509d-7083-41c9-b799-e30e855e9bc0", "external_fixed_ips": [ { "subnet_id": "aa2bc8f7-fa02-4851-ba13-93e57d4c69e1", "ip_address": "69.**.**.**" } ], "enable_snat": true }, "description": "", "availability_zones": [ "nova" ], "availability_zone_hints": [], "routes": [ ], "flavor_id": null, "tags": [], "created_at": "2024-02-06T11:49:58Z", "updated_at": "2024-03-29T14:41:39Z", "revision_number": 17, "project_id": "278fda03174b4fee9358559baffca010" } }
That router's external_fixed_ips
is automatically pre-created by Openstack.
If a VM has FIP attached then outgoing connections are being SNAT'ed from that FIP.
IF a VM has no FIP then connections are being SNAT'ed from the router's external IP.
GET https://compute-api:9696/v2.0/ports?device_id=7142d8f1-2b11-4ae2-a343-eacd77a2ceee
{ "ports": [ { "id": "0411af2f-d447-4f3c-88a7-1e8a57e70015", "name": "", "network_id": "f16855bf-8ba1-4f75-ad8c-763e80134571", "tenant_id": "", "mac_address": "fa:16:3e:44:38:7e", "admin_state_up": true, "status": "ACTIVE", "device_id": "7142d8f1-2b11-4ae2-a343-eacd77a2ceee", "device_owner": "network:router_centralized_snat", "fixed_ips": [ { "subnet_id": "616388c0-519f-418e-80b4-3687a546a65e", "ip_address": "10.21.11.1" } ], "allowed_address_pairs": [], "extra_dhcp_opts": [], "security_groups": [], "description": "", "binding:vnic_type": "normal", "port_security_enabled": false, "qos_policy_id": null, "qos_network_policy_id": null, "tags": [], "created_at": "2024-02-06T14:02:02Z", "updated_at": "2024-03-23T18:11:57Z", "revision_number": 40, "project_id": "" }, { "id": "ded9eafe-3ee0-4f29-9f7f-953470f3a3ae", "name": "", "network_id": "f16855bf-8ba1-4f75-ad8c-763e80134571", "tenant_id": "278fda03174b4fee9358559baffca010", "mac_address": "fa:16:3e:48:d2:da", "admin_state_up": true, "status": "ACTIVE", "device_id": "7142d8f1-2b11-4ae2-a343-eacd77a2ceee", "device_owner": "network:router_interface_distributed", "fixed_ips": [ { "subnet_id": "616388c0-519f-418e-80b4-3687a546a65e", "ip_address": "10.21.10.1" } ], "allowed_address_pairs": [], "extra_dhcp_opts": [], "security_groups": [], "description": "", "binding:vnic_type": "normal", "port_security_enabled": false, "qos_policy_id": null, "qos_network_policy_id": null, "tags": [], "created_at": "2024-02-06T14:02:02Z", "updated_at": "2024-04-02T10:33:28Z", "revision_number": 68, "project_id": "278fda03174b4fee9358559baffca010" } ] }
I've came up with a quick fix already: https://github.com/serge-name/cluster-api-provider-openstack/commit/bb19917957b82959f8406ed9778eebf82ebd7855 works fine so far. Right now I am short in time to create a decent PR.
Does it work for you if you replace
network:router_interface
with network:router_interface_distributed
?Yes, network:router_interface_distributed
works absolutely fine. As it is in the commit https://github.com/serge-name/cluster-api-provider-openstack/commit/a1bf5b88e40b9bc6c5d5f5208628a3e0193e70fe
@bilbobrovall thanks a lot! Your commit elastx@ce38e8b works fine for me and fixes the issue.
There are several minor errors due to premature and frequent (8 API reqs in 2 seconds) checks for FIP. Not a problem for me, just a thing that can be improved later. Logs are follow:
@bilbobrovall thanks a lot! Your commit elastx@ce38e8b works fine for me and fixes the issue.
There are several minor errors due to premature and frequent (8 API reqs in 2 seconds) checks for FIP. Not a problem for me, just a thing that can be improved later. Logs are follow:
👍 It's probably just neutron taking some time, and I think the retries should be fine for now since there's an exponential backoff when a reconciler returns the same error, but the initial retries feels a bit tight in this case.