kubernetes-csi/external-provisioner

distributed provisioning: optimize PV informer

pohly opened this issue · 10 comments

pohly commented

While running clusterloader2 to evaluate performance of volume provisioning with storage capacity tracking (see https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/docs/storage-capacity-tracking.md) I noticed that volume deletion was slow - much slower than creation of volumes. This occurred in particular for high node counts (100 in that case).

The initial suspect was overhead for the PV watches (every PV sent to every external-provisioner instance), but further experimentation didn't confirm that. Instead, it turned out that Pod deletion was slow.

Nonetheless, enhancing the PV watch seems doable:

  • filter PV in the apiserver (ideal solution, because filtering in the apiserver is cheaper than sending out updates to clients which don't need them)
  • filter out foreign PVs before adding them to the workqueue (tried, didn't help much)

/assign

pohly commented

Example: 10 nodes on Azure cloud, csi-driver-host-path master, with --v=3 and WaitForFirstConsumer disabled in deploy/kubernetes-distributed/hostpath/csi-hostpath-storageclass-fast.yaml.

go run cmd/clusterloader.go -v=2 --report-dir=/tmp/clusterloader2-with-volume-10-nodes-capacity --kubeconfig=/home/pohly/.kube/config --provider=local --nodes=10 --testconfig=testing/experimental/storage/pod-startup/config.yaml --testoverrides=testing/experimental/storage/pod-startup/volume-types/persistentvolume/override.yaml --testoverrides=with-volume.yaml

with-volume.yaml:

PODS_PER_NODE: 20
NODES_PER_NAMESPACE: 10
VOLUMES_PER_POD: 1
VOL_SIZE: 5Gi
STORAGE_CLASS: csi-hostpath-fast
GATHER_METRICS: false
POD_STARTUP_TIMEOUT: 45m

Result:

<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="240.468">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="240.463977815"></testcase>
      <testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="20.126110888"></testcase>
      <testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="90.064074266"></testcase>
      <testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.200939786"></testcase>
      <testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="20.122103454"></testcase>
      <testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="34.388961952"></testcase>
      <testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="20.12327411"></testcase>
      <testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="20.125988425"></testcase>
      <testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="10.103055025"></testcase>
pohly commented

Logs show:

I0309 08:54:27.365969       1 controller.go:1477] delete "pvc-59d26170-b93c-417b-ae65-4d03c0f495ac": volume deletion ignored: ignored because PV was not provisioned on this node

This is similar to issue #669, but not quite the same,

pohly commented

Server-side filtering is not possible at the moment. Here's what a PV looks like:

- apiVersion: v1
  kind: PersistentVolume
  metadata:
    annotations:
      pv.kubernetes.io/provisioned-by: hostpath.csi.k8s.io
    creationTimestamp: "2022-03-09T08:23:03Z"
    finalizers:
    - kubernetes.io/pv-protection
    managedFields:
      ...
    name: pvc-02b4a075-80b9-47c8-b646-e2ca6ac5386c
    resourceVersion: "14239"
    uid: 5f8572c7-19c7-4de1-a39b-1c8a05feabab
  spec:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 5Gi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: vol-138
      namespace: test-9j4yhq-1
      resourceVersion: "12755"
      uid: 02b4a075-80b9-47c8-b646-e2ca6ac5386c
    csi:
      driver: hostpath.csi.k8s.io
      volumeAttributes:
        kind: fast
        storage.kubernetes.io/csiProvisionerIdentity: 1646814048800-8081-hostpath.csi.k8s.io-aks-nodepool1-35404716-vmss000001
      volumeHandle: 233a60c8-9f82-11ec-8b2c-129d02cf3a9c
    nodeAffinity:
      required:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.hostpath.csi/node
            operator: In
            values:
            - aks-nodepool1-35404716-vmss000001
    persistentVolumeReclaimPolicy: Delete
    storageClassName: csi-hostpath-fast
    volumeMode: Filesystem
  status:
    phase: Bound

We would need to filter by nodeAffinity, but that is not supported by https://github.com/kubernetes/kubernetes/blob/9946b5364e8199ac832161e17155bd005c391fc5/pkg/registry/core/persistentvolume/strategy.go#L162-L179

external-provisioner could set a label. As there may be existing PVs which don't have it yet, we would have to extend external-provisioner (or some other component) so that it can be deployed in the control plane without a CSI driver and then adds the label to those PVs which don't have it. May also be necessary when users (accidentally?) remove the label. This feels complicated, perhaps too complicated to be worth it.

pohly commented

Implementing the pre-workqueue filtering is possible. For 10 nodes it doesn't make a big difference:

<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="237.878">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="237.874646435"></testcase>
      <testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="20.122253052"></testcase>
      <testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="85.067009928"></testcase>
      <testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.100423552"></testcase>
      <testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="20.127212681"></testcase>
      <testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="36.891739215"></testcase>
      <testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="20.119977666"></testcase>
      <testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="20.116018531"></testcase>
      <testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="10.1019071"></testcase>
pohly commented

No big difference for 10 volumes per node and 100 nodes either:

$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="662.946">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="662.941836976"></testcase>
      <testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="100.612544663"></testcase>
      <testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="50.166796847"></testcase>
      <testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.101135165"></testcase>
      <testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="100.613237416"></testcase>
      <testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="119.493390436"></testcase>
      <testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="100.638563713"></testcase>
      <testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="100.610427034"></testcase>
      <testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="75.117069732"></testcase>
  </testsuite>

$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity-filtering/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="663.363">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="663.358644866"></testcase>
      <testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="100.611435332"></testcase>
      <testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="50.167335721"></testcase>
      <testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.101097687"></testcase>
      <testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="100.593417364"></testcase>
      <testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="115.017417672"></testcase>
      <testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="100.602359492"></testcase>
      <testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="100.621163905"></testcase>
      <testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="75.117128046"></testcase>
pohly commented

20 volumes per node, no big difference either:

$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="1343.606">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="1343.602389857"></testcase>
      <testcase name="storage: [step: 01] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.1010827"></testcase>
      <testcase name="storage: [step: 02] Creating deployments" classname="ClusterLoaderV2" time="201.216153546"></testcase>
      <testcase name="storage: [step: 03] Waiting for deployments to be running" classname="ClusterLoaderV2" time="234.918855189"></testcase>
      <testcase name="storage: [step: 04] Deleting deployments" classname="ClusterLoaderV2" time="201.203148744"></testcase>
      <testcase name="storage: [step: 05] Waiting for volume deletion" classname="ClusterLoaderV2" time="680.596455696"></testcase>
  </testsuite>
  
$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity-filtering/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="1353.62">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="1353.616891482"></testcase>
      <testcase name="storage: [step: 01] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.101016257"></testcase>
      <testcase name="storage: [step: 02] Creating deployments" classname="ClusterLoaderV2" time="201.222271449"></testcase>
      <testcase name="storage: [step: 03] Waiting for deployments to be running" classname="ClusterLoaderV2" time="234.712012496"></testcase>
      <testcase name="storage: [step: 04] Deleting deployments" classname="ClusterLoaderV2" time="201.198080417"></testcase>
      <testcase name="storage: [step: 05] Waiting for volume deletion" classname="ClusterLoaderV2" time="680.621806619"></testcase>
pohly commented

Perhaps the problem is elsewhere: while testing with 50 pods per node, the test got all of them running and then deleted all deployments that own these pods, but the pods kept running. Not sure whether that is supposed to happen.

pohly commented

It looks like deleting Pods was slow. The Deployments and Replicasets were gone, but Pods lingered for an extended period of time.

pohly commented

There's no clear need to optimize this. Let's close for now.

/close

@pohly: Closing this issue.

In response to this:

There's no clear need to optimize this. Let's close for now.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.