moby/swarmkit

One bad CSI volume can stop all volumes from being scheduled

s4ke opened this issue ยท 11 comments

s4ke commented

While implementing CSI support for Hetzner Cloud we ran into some strange behaviour around cases where one "broken" volume can cause all other volumes to not schedule properly.

In the case of the Hetzner plugin this was caused by Volumes created in hel1 generally not being allowed to be scheduled in nbg1 - which is expected. But due to the fact that this volume was also tried to be published in the node in nbg1 (How would we prevent this? Do/Should Cluster volumes have support for placement constraints?) this will never work on these nodes. Another volume created in nbg1, which should succeed, was however blocked by the earlier failure from succeeding.

Only after force removing all volumes and services and starting from scratch (and this time making sure to only schedule stuff in nbg1), we were able to properly create a volume.

see hetznercloud/csi-driver#376 (comment) for further details

s4ke commented

@neersighted should we create another issue for cluster volumes not having support for placement constraints?

s4ke commented

@dperny Can you take a look here - the linked comment on the hetznercloud csi driver explains the issue quite well? I would love to help in any way I can here.

I'm taking a look at this issue now.

@s4ke I believe there are several issues afoot here:

  1. There is some error causing volumes to attempt to be scheduled to invalid nodes.
  2. There is some error resulting in other errors locking up the volume management component.
  3. There is some other error that may result in nodes incorrectly reporting volume status to the managers.

It's all quite nasty.

s4ke commented

@dperny okay. Thanks for taking a look. Were you able to reproduce this issue? Do you think that this is something on the driver level or in swarmkit?

So, the open questions I have right now about the linked issue:

The Volume is getting scheduled to a node which is outside of its availability constraint. This is odd. However, the CSIInfo field shows only PluginName, NodeID, and MaxVolumesPerNode. It does not seem to show AccessibleTopology. CSI volumes are scheduled based on AccessibleTopology as reported by the Node (which it gets from the plugin), and not by Labels on the Node (which are used for regular placement constraints).

But that said, even a blank AccessibleTopology should not result in a decision to schedule the Volume to a the Node. I've checked the function myself, even written a test. It should not be scheduled. So the question remains, why?

Further, there is some issue that I know of that I believe is related by which a Volume object loses the PublishStatus for all nodes, and ends up back in Created status. So I know there is an issue, somewhere, with the Volume PublishStatus being incorrectly set. The question for that is also why?

Next, why is the Volume ID not being included in the ControllerPublishVolume request, as the logs seem to indicate? That shouldn't be possible. The Volume is not considered Created until it has its ID.

There's a rats nest of issues that I suspect are all related to a small set of root causes. Whoever wrote this CSI code is a doofus.

s4ke commented

I will see what I can do to help answer your questions and when.

s4ke commented

Could this be related? moby/moby#45547

That's exactly the issue I had in mind.

OK, for starters, I have figured out one problem.

This is where we convert the gRPC response into Docker API objects:

https://github.com/moby/moby/blob/b3843992fc12536908fea2fea3ece05725b1e613/daemon/cluster/convert/node.go#L59-L70

And this is the Docker API object in question:

https://github.com/moby/moby/blob/b3843992fc12536908fea2fea3ece05725b1e613/api/types/swarm/node.go#L72-L85

It seems I am forgetting something critical in the conversion. AccessibleTopology is being ignored in the conversion, which makes debugging this issue difficult. The Scheduler takes into account the AccessibleTopology of the Node as reported by the CSI plugin to make its scheduling decisions.

Without knowing what AccessibleTopology is being reported, I cannot know if the Scheduler is making an error, or is suffering from garbage-in-garbage-out.

s4ke commented

It's been a while since I was in this discussion. I honestly lost track of where we are with this. How can I help with this? Are the questions still open?