kubernetes-sigs/gcp-compute-persistent-disk-csi-driver

Race condition between csi-driver and GCP

dguendisch opened this issue · 12 comments

We frequently run into situations where a pod's volume cannot be attached to some node Y, because on GCP it is still attached to a node X where the pod was previously located. In K8s there are however no traces of the volume being attached to node X, specifically there is no volumeattachment resource mapping the volume to node X and node X' .status.volumesAttached/volumesInUse has no signs of that volume; this indicates that it (at some point in time) was successfully detached from X.

After a lot of digging (in gcp-csi-driver and GCP audit logs) I found the following race condition to happen presumably because there is no ordering of sequential operations or locking of ongoing operations happening, this is the ordered sequence of events:

  • csi-driver attaches disk to node X; gcp-csi-driver times out but GCP tracked the request gcp-operation-ID: 1
  • csi-driver attaches disk to node X again; this time it succeeds (gcp-operation-ID: 2)
  • pod gets rescheduled to another node Y about 2 mins later, so the volume must move from node X to node Y
  • csi-driver detaches disk from node X and succeeds (gcp-operation-ID: 3)
  • now gcp-operation-ID: 1 (resurrected from the dead) finally succeeds; disk is attached to node X again
  • csi-driver tries to attach disk to node Y (because of the pod reschedule) and never succeeds

Hmm, arcus lifo queuing is the ultimate problem here. We'd fixed a bunch of these races with the error backoff (I think it was) but it seems there's still a few out there.

I'm not sure what the right fix is TBH. Since the volume is never marked on the old node, the attacher won't know that it needs to be detached.

A fix in arcus (the GCE/PD control plane) is actually in process of rolling out. This fix will enforce fifo of operations and will merge things like op 1 and op 2 in your example. The rollout should be complete in about a month.

The workarounds we've discussed for this at the CSI layer all of various levels of hackery and danger around them, so I think it's best to just wait for the arcus fix.

Thank you for this follow up! Glad to hear about arcus enforcing fifo soon 👍

qq: is the above fix meanwhile rolled out?

qq: is the above fix meanwhile rolled out?

@mattcary @msau42 any news about the above question?

qq: is the above fix meanwhile rolled out?

@mattcary @msau42 any news about the above question?

ping

The fix is currently rolling out. Should be complete within the next few weeks.

Thanks @msau42 ! Please let us know in this issue when the fix rollout is complete. We continue to see the above-described issue in our GCP clusters.