openshift/sdn

Pods networking is broken after openvswitch is restarted

Closed this issue · 9 comments

bbl commented

Description

After ovs pod is restarted all pods on the corresponding node come up with broken networking. The gateway is not reachable, thus all egress connections are not possible.
If ovs-ofctl -O OpenFlow13 dump-ports-desc br0 is run inside the ovs pod, the output doesn't show old vethXXX interfaces, however they're still present on host.

Version
  • The output of git describe of openshift-ansible
openshift-ansible-3.11.146-1-22-g37e13e5
  • ovs image version:
docker.io/openshift/origin-node:v3.11
Steps To Reproduce
  1. Delete/restart the ovs pod on the compute node.
  2. Run ovs-ofctl -O OpenFlow13 dump-ports-desc br0, verify that veth interfaces are missing.
Expected Results

Expected pod networking is not broken after ovs is restarted. Old vethXXX interfaces are picked by ovs after the restart.

Additional Information
  • Operating system and version: CentOS 7

Restarting OVS should cause the SDN pod to restart, and it should reattach the pods then. Is that not happening?

bbl commented

Restarting OVS should cause the SDN pod to restart, and it should reattach the pods then. Is that not happening?

The sdn pod is restarted along with ovs, but pods network is not attached.

bbl commented

@danwinship we've encountered this issue one more time today. Is there any possible fix?

This may be fixed by #58, but it's not clear if/when that's going to be backported to 3.11. If that is the problem, then you could fix it by stopping the SDN pod before you restart the OVS pod, and then restarting it afterward.

@danwinship There's a backport PR proposed: openshift/origin#24318

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.