Pods networking is broken after openvswitch is restarted

Question

Pods networking is broken after openvswitch is restarted

Closed this issue 4 years ago · 9 comments

Description

After ovs pod is restarted all pods on the corresponding node come up with broken networking. The gateway is not reachable, thus all egress connections are not possible.
If ovs-ofctl -O OpenFlow13 dump-ports-desc br0 is run inside the ovs pod, the output doesn't show old vethXXX interfaces, however they're still present on host.

Version

The output of git describe of openshift-ansible

openshift-ansible-3.11.146-1-22-g37e13e5

ovs image version:

docker.io/openshift/origin-node:v3.11

Steps To Reproduce

Delete/restart the ovs pod on the compute node.
Run ovs-ofctl -O OpenFlow13 dump-ports-desc br0, verify that veth interfaces are missing.

Expected Results

Expected pod networking is not broken after ovs is restarted. Old vethXXX interfaces are picked by ovs after the restart.

Additional Information

Operating system and version: CentOS 7

Answer 1 · 2019-10-07T11:58:06.000Z

Restarting OVS should cause the SDN pod to restart, and it should reattach the pods then. Is that not happening?

Answer 2 · 2019-10-07T18:44:14.000Z

Restarting OVS should cause the SDN pod to restart, and it should reattach the pods then. Is that not happening?

The sdn pod is restarted along with ovs, but pods network is not attached.

Answer 3 · 2019-11-04T13:38:37.000Z

@danwinship we've encountered this issue one more time today. Is there any possible fix?

Answer 4 · 2019-11-08T14:00:53.000Z

This may be fixed by #58, but it's not clear if/when that's going to be backported to 3.11. If that is the problem, then you could fix it by stopping the SDN pod before you restart the OVS pod, and then restarting it afterward.

Answer 5 · 2019-12-19T00:04:56.000Z

@danwinship There's a backport PR proposed: openshift/origin#24318

Answer 6 · 2020-09-24T11:19:35.000Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Answer 7 · 2020-10-25T07:13:26.000Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Answer 8 · 2020-11-24T09:03:33.000Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Answer 9 · 2020-11-24T09:03:50.000Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.