distributed provisioning: optimize PV informer
pohly opened this issue · 10 comments
While running clusterloader2 to evaluate performance of volume provisioning with storage capacity tracking (see https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/docs/storage-capacity-tracking.md) I noticed that volume deletion was slow - much slower than creation of volumes. This occurred in particular for high node counts (100 in that case).
The initial suspect was overhead for the PV watches (every PV sent to every external-provisioner instance), but further experimentation didn't confirm that. Instead, it turned out that Pod deletion was slow.
Nonetheless, enhancing the PV watch seems doable:
- filter PV in the apiserver (ideal solution, because filtering in the apiserver is cheaper than sending out updates to clients which don't need them)
- filter out foreign PVs before adding them to the workqueue (tried, didn't help much)
/assign
Example: 10 nodes on Azure cloud, csi-driver-host-path master, with --v=3
and WaitForFirstConsumer
disabled in deploy/kubernetes-distributed/hostpath/csi-hostpath-storageclass-fast.yaml
.
go run cmd/clusterloader.go -v=2 --report-dir=/tmp/clusterloader2-with-volume-10-nodes-capacity --kubeconfig=/home/pohly/.kube/config --provider=local --nodes=10 --testconfig=testing/experimental/storage/pod-startup/config.yaml --testoverrides=testing/experimental/storage/pod-startup/volume-types/persistentvolume/override.yaml --testoverrides=with-volume.yaml
with-volume.yaml:
PODS_PER_NODE: 20
NODES_PER_NAMESPACE: 10
VOLUMES_PER_POD: 1
VOL_SIZE: 5Gi
STORAGE_CLASS: csi-hostpath-fast
GATHER_METRICS: false
POD_STARTUP_TIMEOUT: 45m
Result:
<?xml version="1.0" encoding="UTF-8"?>
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="240.468">
<testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="240.463977815"></testcase>
<testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="20.126110888"></testcase>
<testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="90.064074266"></testcase>
<testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.200939786"></testcase>
<testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="20.122103454"></testcase>
<testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="34.388961952"></testcase>
<testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="20.12327411"></testcase>
<testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="20.125988425"></testcase>
<testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="10.103055025"></testcase>
Logs show:
I0309 08:54:27.365969 1 controller.go:1477] delete "pvc-59d26170-b93c-417b-ae65-4d03c0f495ac": volume deletion ignored: ignored because PV was not provisioned on this node
This is similar to issue #669, but not quite the same,
Server-side filtering is not possible at the moment. Here's what a PV looks like:
- apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: hostpath.csi.k8s.io
creationTimestamp: "2022-03-09T08:23:03Z"
finalizers:
- kubernetes.io/pv-protection
managedFields:
...
name: pvc-02b4a075-80b9-47c8-b646-e2ca6ac5386c
resourceVersion: "14239"
uid: 5f8572c7-19c7-4de1-a39b-1c8a05feabab
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 5Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: vol-138
namespace: test-9j4yhq-1
resourceVersion: "12755"
uid: 02b4a075-80b9-47c8-b646-e2ca6ac5386c
csi:
driver: hostpath.csi.k8s.io
volumeAttributes:
kind: fast
storage.kubernetes.io/csiProvisionerIdentity: 1646814048800-8081-hostpath.csi.k8s.io-aks-nodepool1-35404716-vmss000001
volumeHandle: 233a60c8-9f82-11ec-8b2c-129d02cf3a9c
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.hostpath.csi/node
operator: In
values:
- aks-nodepool1-35404716-vmss000001
persistentVolumeReclaimPolicy: Delete
storageClassName: csi-hostpath-fast
volumeMode: Filesystem
status:
phase: Bound
We would need to filter by nodeAffinity
, but that is not supported by https://github.com/kubernetes/kubernetes/blob/9946b5364e8199ac832161e17155bd005c391fc5/pkg/registry/core/persistentvolume/strategy.go#L162-L179
external-provisioner could set a label. As there may be existing PVs which don't have it yet, we would have to extend external-provisioner (or some other component) so that it can be deployed in the control plane without a CSI driver and then adds the label to those PVs which don't have it. May also be necessary when users (accidentally?) remove the label. This feels complicated, perhaps too complicated to be worth it.
Implementing the pre-workqueue filtering is possible. For 10 nodes it doesn't make a big difference:
<?xml version="1.0" encoding="UTF-8"?>
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="237.878">
<testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="237.874646435"></testcase>
<testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="20.122253052"></testcase>
<testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="85.067009928"></testcase>
<testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.100423552"></testcase>
<testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="20.127212681"></testcase>
<testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="36.891739215"></testcase>
<testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="20.119977666"></testcase>
<testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="20.116018531"></testcase>
<testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="10.1019071"></testcase>
No big difference for 10 volumes per node and 100 nodes either:
$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="662.946">
<testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="662.941836976"></testcase>
<testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="100.612544663"></testcase>
<testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="50.166796847"></testcase>
<testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.101135165"></testcase>
<testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="100.613237416"></testcase>
<testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="119.493390436"></testcase>
<testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="100.638563713"></testcase>
<testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="100.610427034"></testcase>
<testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="75.117069732"></testcase>
</testsuite>
$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity-filtering/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="663.363">
<testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="663.358644866"></testcase>
<testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="100.611435332"></testcase>
<testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="50.167335721"></testcase>
<testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.101097687"></testcase>
<testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="100.593417364"></testcase>
<testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="115.017417672"></testcase>
<testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="100.602359492"></testcase>
<testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="100.621163905"></testcase>
<testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="75.117128046"></testcase>
20 volumes per node, no big difference either:
$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="1343.606">
<testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="1343.602389857"></testcase>
<testcase name="storage: [step: 01] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.1010827"></testcase>
<testcase name="storage: [step: 02] Creating deployments" classname="ClusterLoaderV2" time="201.216153546"></testcase>
<testcase name="storage: [step: 03] Waiting for deployments to be running" classname="ClusterLoaderV2" time="234.918855189"></testcase>
<testcase name="storage: [step: 04] Deleting deployments" classname="ClusterLoaderV2" time="201.203148744"></testcase>
<testcase name="storage: [step: 05] Waiting for volume deletion" classname="ClusterLoaderV2" time="680.596455696"></testcase>
</testsuite>
$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity-filtering/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="1353.62">
<testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="1353.616891482"></testcase>
<testcase name="storage: [step: 01] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.101016257"></testcase>
<testcase name="storage: [step: 02] Creating deployments" classname="ClusterLoaderV2" time="201.222271449"></testcase>
<testcase name="storage: [step: 03] Waiting for deployments to be running" classname="ClusterLoaderV2" time="234.712012496"></testcase>
<testcase name="storage: [step: 04] Deleting deployments" classname="ClusterLoaderV2" time="201.198080417"></testcase>
<testcase name="storage: [step: 05] Waiting for volume deletion" classname="ClusterLoaderV2" time="680.621806619"></testcase>
Perhaps the problem is elsewhere: while testing with 50 pods per node, the test got all of them running and then deleted all deployments that own these pods, but the pods kept running. Not sure whether that is supposed to happen.
It looks like deleting Pods was slow. The Deployments and Replicasets were gone, but Pods lingered for an extended period of time.
There's no clear need to optimize this. Let's close for now.
/close
@pohly: Closing this issue.
In response to this:
There's no clear need to optimize this. Let's close for now.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.