Failed to create snapshot
znaive opened this issue · 17 comments
version information
k8s: 1.23.0
csi-driver-nfs : v4.4.0
volumesnapshotclass
$ kubectl get volumesnapshotclass
NAME DRIVER DELETIONPOLICY AGE
csi-nfs-snapclass nfs.csi.k8s.io Delete 3h25m
$ cat snapshotclass-nfs.yaml
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-nfs-snapclass
driver: nfs.csi.k8s.io
deletionPolicy: Delete
volumesnapshot
$ cat snapshot-nfs-dynamic.yaml
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: test-nfs-snapshot
spec:
volumeSnapshotClassName: csi-nfs-snapclass
source:
persistentVolumeClaimName: win2012-snapshot
storageclass
cat storageclass-nfs.yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-csi
provisioner: nfs.csi.k8s.io
parameters:
server: 172.28.100.37
share: /data
# csi.storage.k8s.io/provisioner-secret is only needed for providing mountOptions in DeleteVolume
# csi.storage.k8s.io/provisioner-secret-name: "mount-options"
# csi.storage.k8s.io/provisioner-secret-namespace: "default"
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
volumesnapshot
$ kubectl get vs
NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE
test-nfs-snapshot false win2012-snapshot csi-nfs-snapclass snapcontent-e08b4ce1-2e9b-4d44-9128-e04839434e23 19m
pvc
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
finalizers:
- kubernetes.io/pvc-protection
- snapshot.storage.kubernetes.io/pvc-as-source-protection
name: win2012-snapshot
namespace: default
resourceVersion: "22827160"
uid: ef9fb0fa-5d30-4db7-a347-ba1a0cb96a74
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: nfs-csi
volumeMode: Filesystem
volumeName: pvc-ef9fb0fa-5d30-4db7-a347-ba1a0cb96a74
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi
phase: Bound
log
snapshot-controller
kubectl logs snapshot-controller-7d8977cd65-dmhdm -n kube-system -f
csi-nfs-controller
kubectl logs csi-nfs-controller-5cc4566d6-skhxb -n kube-system -c csi-snapshotter
It keeps looping with this error
looks like snapshot creation timed out, the default timeout is 1 minute and it's very likely that if you have a good amount of data in that PV, it may take longer than that. It is implemented pretty naively as tarball and creating tar.gz
for couple of gigabytes may take longer than one minute.
you may configure the timeout to be higher in the external-snapshotter parameters
https://github.com/kubernetes-csi/external-snapshotter#important-optional-arguments-that-are-highly-recommended-to-be-used-1
looks like snapshot creation timed out, the default timeout is 1 minute and it's very likely that if you have a good amount of data in that PV, it may take longer than that. It is implemented pretty naively as tarball and creating
tar.gz
for couple of gigabytes may take longer than one minute.you may configure the timeout to be higher in the external-snapshotter parameters https://github.com/kubernetes-csi/external-snapshotter#important-optional-arguments-that-are-highly-recommended-to-be-used-1
When I add timeout in yaml I get an error, there is no such field
yaml
# This YAML file shows how to deploy the snapshot controller
# The snapshot controller implements the control loop for CSI snapshot functionality.
# It should be installed as part of the base Kubernetes distribution in an appropriate
# namespace for components implementing base system functionality. For installing with
# Vanilla Kubernetes, kube-system makes sense for the namespace.
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: snapshot-controller
namespace: kube-system
spec:
replicas: 2
selector:
matchLabels:
app: snapshot-controller
# the snapshot controller won't be marked as ready if the v1 CRDs are unavailable
# in #504 the snapshot-controller will exit after around 7.5 seconds if it
# can't find the v1 CRDs so this value should be greater than that
minReadySeconds: 15
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: snapshot-controller
spec:
serviceAccountName: snapshot-controller
nodeSelector:
kubernetes.io/os: linux
priorityClassName: system-cluster-critical
securityContext:
seccompProfile:
type: RuntimeDefault
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/controlplane"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/control-plane"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- name: snapshot-controller
image: registry.k8s.io/sig-storage/snapshot-controller:v6.2.2
args:
- "--v=2"
- "--leader-election=true"
- "--leader-election-namespace=kube-system"
- "--timeout=30m"
resources:
limits:
memory: 300Mi
requests:
cpu: 10m
memory: 20Mi
log
$ kubectl logs snapshot-controller-6fd7d5d77f-dfb8g -n kube-system
flag provided but not defined: -timeout
Usage of /snapshot-controller:
-add_dir_header
If true, adds the file directory to the header of the log messages
-alsologtostderr
log to standard error as well as files (no effect when -logtostderr=true)
-enable-distributed-snapshotting
Enables each node to handle snapshotting for the local volumes created on that node
-http-endpoint string
The TCP network address where the HTTP server for diagnostics, including metrics, will listen (example: :8080). The default is empty string, which means the server is disabled.
-kube-api-burst int
Burst to use while communicating with the kubernetes apiserver. Defaults to 10. (default 10)
-kube-api-qps float
QPS to use while communicating with the kubernetes apiserver. Defaults to 5.0. (default 5)
-kubeconfig string
Absolute path to the kubeconfig file. Required only when running out of cluster.
-leader-election
Enables leader election.
-leader-election-lease-duration duration
Duration, in seconds, that non-leader candidates will wait to force acquire leadership. Defaults to 15 seconds. (default 15s)
-leader-election-namespace string
The namespace where the leader election resource exists. Defaults to the pod namespace if not set.
-leader-election-renew-deadline duration
Duration, in seconds, that the acting leader will retry refreshing leadership before giving up. Defaults to 10 seconds. (default 10s)
-leader-election-retry-period duration
Duration, in seconds, the LeaderElector clients should wait between tries of actions. Defaults to 5 seconds. (default 5s)
-log_backtrace_at value
when logging hits line file:N, emit a stack trace
-log_dir string
If non-empty, write log files in this directory (no effect when -logtostderr=true)
-log_file string
If non-empty, use this log file (no effect when -logtostderr=true)
-log_file_max_size uint
Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
-logtostderr
log to standard error instead of files (default true)
-metrics-path /metrics
The HTTP path where prometheus metrics will be exposed. Default is /metrics. (default "/metrics")
-one_output
If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)
-prevent-volume-mode-conversion
Prevents an unauthorised user from modifying the volume mode when creating a PVC from an existing VolumeSnapshot.
-resync-period duration
Resync interval of the controller. (default 15m0s)
-retry-crd-interval-max duration
Maximum retry interval to wait for CRDs to appear. The default is 5 seconds. (default 5s)
-retry-interval-max duration
Maximum retry interval of failed volume snapshot creation or deletion. Default is 5 minutes. (default 5m0s)
-retry-interval-start duration
Initial retry interval of failed volume snapshot creation or deletion. It doubles with each failure, up to retry-interval-max. Default is 1 second. (default 1s)
-skip_headers
If true, avoid header prefixes in the log messages
-skip_log_headers
If true, avoid headers when opening log files (no effect when -logtostderr=true)
-stderrthreshold value
logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=false) (default 2)
-v value
number for the log level verbosity
-version
Show version.
-vmodule value
comma-separated list of pattern=N settings for file-filtered logging
-worker-threads int
Number of worker threads. (default 10)
the --timeout
should be configured on the external-snapshotter
container, the one you shared is snapshot-controller
, and that one indeed does not support --timeout
as an argument.
the
--timeout
should be configured on theexternal-snapshotter
container, the one you shared issnapshot-controller
, and that one indeed does not support--timeout
as an argument.
So how do I use external-snapshotter, do I need to deploy this, thank you very much for answering me!
if you are using helm and running the latest released version v4.4.0, then it should be enabled by default. This is the configuration knob
oh, my apologies, I navigated you poorly. It should be actually snapshot-controller
, I think you were setting it correctly, let me take a closer look :)
oh, my apologies, I navigated you poorly. It should be actually
snapshot-controller
, I think you were setting it correctly, let me take a closer look :)
Yes, please. Thank you so much.
it should be this location in the csi-snapshotter
, the is flag implemented there
it should be this location in the
csi-snapshotter
, the is flag implemented there
This is correct, thank you. I also wanted to ask if this doesn't support snapshot recovery
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: restore-pvc
spec:
storageClassName: nfs-csi
dataSource:
name: test-nfs-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
It stays pending after it is created
snapshot restore should be supported, and your manifest looks correct to me, can you please share some csi-nfs-controller
logs?
snapshot restore should be supported, and your manifest looks correct to me, can you please share some
csi-nfs-controller
logs?
No log output
kubectl logs csi-nfs-controller-6bc96c75d7-8wfq2 -n kube-system -c csi-snapshotter -f
$ kubectl apply -f pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: restore-pvc
spec:
storageClassName: nfs-csi
dataSource:
name: test-nfs-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
kubectl describe pvc restore-pvc
kubectl logs csi-nfs-controller-6bc96c75d7-8wfq2 -n kube-system -c csi-provisioner -f
looks like again context deadline exceeded
, can you perhaps try first with a smaller volume just so we can feel confident it's a matter of size?
looks like again
context deadline exceeded
, can you perhaps try first with a smaller volume just so we can feel confident it's a matter of size?
When I use 200Mi it's normal.
yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-test
spec:
storageClassName: nfs-csi
accessModes:
- ReadWriteMany
resources:
requests:
storage: 200Mi
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: pvc-test
spec:
volumeSnapshotClassName: csi-nfs-snapclass
source:
persistentVolumeClaimName: pvc-test
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: restore-pvc
spec:
storageClassName: nfs-csi
dataSource:
name: pvc-test
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Mi
volumesnapshot
NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE
pvc-test true pvc-test 104 csi-nfs-snapclass snapcontent-8d204215-3d13-4d6f-b119-6f3eb151ae77 11s 12s
pvc
pvc-test Bound pvc-85ef92d4-b5ff-4b05-bc09-0d9adf284a04 200Mi RWX nfs-csi 7m30s
restore-pvc Bound pvc-a6efbc8f-2f97-4fa0-9992-8a5bc30d954b 200Mi RWO nfs-csi 4m26s
Is there any way to snapshot high volume pvc?
When I use 200Mi it's normal.
good, we at least established it's operational in your setup :)
Is there any way to snapshot high volume pvc?
you will need to find a sufficient value for --timeout
and possibly bump the memory limits for the csi-nfs-controller
. Creating a tarball out of 20Gi content over NFS will depend on the network throughput, number of files, and NFS client caching ability. I personally would go as far as removing the memory limit so the driver doesn't get oomkilled and then from metrics of memory consumption you can derive a reasonable memory limit. It could easily be a few GB of memory and take as much as a few hours if the network connection is not very fast.
NFS doesn't really have a way to create native snapshots, that is why it's implemented here as compressed tarballs. For better performance, you may need to explore more feature-rich CSI drivers and set up different storage.
When I use 200Mi it's normal.
good, we at least established it's operational in your setup :)
Is there any way to snapshot high volume pvc?
you will need to find a sufficient value for
--timeout
and possibly bump the memory limits for thecsi-nfs-controller
. Creating a tarball out of 20Gi content over NFS will depend on the network throughput, number of files, and NFS client caching ability. I personally would go as far as removing the memory limit so the driver doesn't get oomkilled and then from metrics of memory consumption you can derive a reasonable memory limit. It could easily be a few GB of memory and take as much as a few hours if the network connection is not very fast.NFS doesn't really have a way to create native snapshots, that is why it's implemented here as compressed tarballs. For better performance, you may need to explore more feature-rich CSI drivers and set up different storage.
I've removed the memory limit for all components and get the same result, still not possible
per #509 (comment), it was context deadline exceeded
, so the --timeout
was not sufficient for the setup you have. Perhaps there is too much data, too many files or too slow NFS, for the 30m
you set earlier. You can try bumping it higher or possibly benchmark your NFS server with tools like fio to have a better idea about setting the --timeout
.