One bad CSI volume can stop all volumes from being scheduled
s4ke opened this issue ยท 11 comments
While implementing CSI support for Hetzner Cloud we ran into some strange behaviour around cases where one "broken" volume can cause all other volumes to not schedule properly.
In the case of the Hetzner plugin this was caused by Volumes created in hel1 generally not being allowed to be scheduled in nbg1 - which is expected. But due to the fact that this volume was also tried to be published in the node in nbg1 (How would we prevent this? Do/Should Cluster volumes have support for placement constraints?) this will never work on these nodes. Another volume created in nbg1, which should succeed, was however blocked by the earlier failure from succeeding.
Only after force removing all volumes and services and starting from scratch (and this time making sure to only schedule stuff in nbg1), we were able to properly create a volume.
see hetznercloud/csi-driver#376 (comment) for further details
@neersighted should we create another issue for cluster volumes not having support for placement constraints?
@dperny Can you take a look here - the linked comment on the hetznercloud csi driver explains the issue quite well? I would love to help in any way I can here.
I'm taking a look at this issue now.
@s4ke I believe there are several issues afoot here:
- There is some error causing volumes to attempt to be scheduled to invalid nodes.
- There is some error resulting in other errors locking up the volume management component.
- There is some other error that may result in nodes incorrectly reporting volume status to the managers.
It's all quite nasty.
@dperny okay. Thanks for taking a look. Were you able to reproduce this issue? Do you think that this is something on the driver level or in swarmkit?
So, the open questions I have right now about the linked issue:
The Volume is getting scheduled to a node which is outside of its availability constraint. This is odd. However, the CSIInfo
field shows only PluginName
, NodeID
, and MaxVolumesPerNode
. It does not seem to show AccessibleTopology
. CSI volumes are scheduled based on AccessibleTopology
as reported by the Node (which it gets from the plugin), and not by Labels on the Node (which are used for regular placement constraints).
But that said, even a blank AccessibleTopology
should not result in a decision to schedule the Volume to a the Node. I've checked the function myself, even written a test. It should not be scheduled. So the question remains, why?
Further, there is some issue that I know of that I believe is related by which a Volume object loses the PublishStatus
for all nodes, and ends up back in Created
status. So I know there is an issue, somewhere, with the Volume PublishStatus
being incorrectly set. The question for that is also why?
Next, why is the Volume ID not being included in the ControllerPublishVolume
request, as the logs seem to indicate? That shouldn't be possible. The Volume is not considered Created until it has its ID.
There's a rats nest of issues that I suspect are all related to a small set of root causes. Whoever wrote this CSI code is a doofus.
I will see what I can do to help answer your questions and when.
Could this be related? moby/moby#45547
That's exactly the issue I had in mind.
OK, for starters, I have figured out one problem.
This is where we convert the gRPC response into Docker API objects:
And this is the Docker API object in question:
It seems I am forgetting something critical in the conversion. AccessibleTopology is being ignored in the conversion, which makes debugging this issue difficult. The Scheduler takes into account the AccessibleTopology of the Node as reported by the CSI plugin to make its scheduling decisions.
Without knowing what AccessibleTopology is being reported, I cannot know if the Scheduler is making an error, or is suffering from garbage-in-garbage-out.
It's been a while since I was in this discussion. I honestly lost track of where we are with this. How can I help with this? Are the questions still open?