kubernetes-csi/external-provisioner

CSIStorageCapacity: Topology segment not updated

samuelluohaoen1 opened this issue · 20 comments

What happened:
After new node plugins join the cluster and report new AccessibleTopologies.Segments, the current segment information is not getting updated. New CSIStorageCapacity objects are not being created.

What you expected to happen:
New node plugins reporting new values for existing topology segments should in a sense "expand" the value sets of existing topology segments. Which in turn should result in CSIStorageCapacity objects being created for new accessible segments.

How to reproduce it:

  1. Suppose the CSIDriver has name com.foo.bar. Check that STORAGECAPACITY is true.
  2. Deploy controller plugin but not node plugin. Wait for external-provisioner to print "Initial number of topology segments 0, storage classes 0, potential CSIStorageCapacity objects 0" (To see this log run external-provisioner with log level 5).
  3. Now CSINode should have DRIVERS: 0.
  4. Deploy the node plugin. Wait for the NodeGetInfo RPC to be called. The RPC should return something like
{
    "NodeId": "some-node",
    "AccessibleTopologies": {
        "Segments": [
            "kubernetes.io/hostname": "some-node"
        ]
    }
}
  1. Now CSINode should have DRIVERS: 1 which is named com.foo.bar with Node ID: some-node and Topology Keys: [kubernetes.io/hostname].
  2. Deploy a StorageClass with volumeBindingMode: WaitForFirstConsumer and provisioner: com.foo.bar.
  3. No new CSIStorageCapacity object is created.

Anything else we need to know?:
I am using the "kubernetes.io/hostname" label as the only key because we want topology to be constraint by each node. Each PV is to be provisioned locally on some node. I also assumed that "kubernetes.io/hostname" is unique across the nodes and should by default exist on every node (I hope this is a reasonable assumption).

Environment:

  • Driver version: v3.0.0
  • Kubernetes version (use kubectl version): 1.25+
  • OS (e.g. from /etc/os-release): Our in-house OS which is very similar to CentOS
  • Kernel (e.g. uname -a): Linux 4.18.0
  • Install tools: kubeadm
  • Others:

@pohly

pohly commented

No new CSIStorageCapacity object is created.

How do you check for this? With kubectl get csistoragecapacities or kubectl get --all-namespaces csistoragecapacities?

CSIStorageCapacity objects are namespaced, so the second command has to be used.

I tried to reproduce the issue with csi-driver-host-path v1.10.0, but there I get new CSIStorageCapacity objects after creating a storage class.

pohly commented

My commands:

/deploy/kubernetes-distributed/deploy.sh
kubectl delete storageclass.storage.k8s.io/csi-hostpath-slow
kubectl delete storageclass.storage.k8s.io/csi-hostpath-fast
kubectl get --all-namespaces csistoragecapacity
kubectl create -f deploy/kubernetes-distributed/hostpath/csi-hostpath-storageclass-fast.yaml
kubectl get --all-namespaces csistoragecapacity
pohly commented

csi-provisioner:v3.3.0

No new CSIStorageCapacity object is created.

How do you check for this? With kubectl get csistoragecapacities or kubectl get --all-namespaces csistoragecapacities?

CSIStorageCapacity objects are namespaced, so the second command has to be used.

I tried to reproduce the issue with csi-driver-host-path v1.10.0, but there I get new CSIStorageCapacity objects after creating a storage class.

Yes it is indeed namespaced. My kubectl has the default namespace set to the namespace where the CSI plugins are deployed.

My commands:

/deploy/kubernetes-distributed/deploy.sh
kubectl delete storageclass.storage.k8s.io/csi-hostpath-slow
kubectl delete storageclass.storage.k8s.io/csi-hostpath-fast
kubectl get --all-namespaces csistoragecapacity
kubectl create -f deploy/kubernetes-distributed/hostpath/csi-hostpath-storageclass-fast.yaml
kubectl get --all-namespaces csistoragecapacity

From the sequence of your commands I do not see how the controller plugin is deployment before the node plugins. I think the order of the deployment may be crucial to reproducing this issue. Could you make sure that step 2 happens before node plugins are deployed? Thank you for your trouble.

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly commented

/reopen
/assign

@pohly: Reopened this issue.

In response to this:

/reopen
/assign

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly commented

@samuelluohaoen1: it looks like you are using a central controller for your CSI driver. Is that correct?

Can you perhaps share the external-provisioner log at level >= 5? The is code which should react to changes in the node and CSIDriver objects when the node plugin gets registered after the controller has started.

We don't have a CSI driver deployment readily available to test this scenario. I tried reproducing it through unit tests (see #942) but the code worked as expected.

/remove-lifecycle rotten