velero completely ignores `--csi-snapshot-timeout` flag
jmtx1020 opened this issue ยท 14 comments
What steps did you take and what happened:
I created a volume like this:
velero backup create data-backup \
--include-namespaces app-12345-425 \
--include-resources=persistentvolumeclaims \
--selector='test=velero-backup' \
--csi-snapshot-timeout=20m
Results in the following logs. it actually does create the volume backup but deletes it. I've timed how long it takes to create a backup manually and it takes about 47 seconds - so I think the issue might be in the fact it's only checking for 50 seconds.
time="2024-12-03T04:09:00Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot app-12345-425/velero-app-12345-425-as-dc1-app-123456pfrq. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:712" pluginName=velero
time="2024-12-03T04:09:05Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:10Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:15Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:20Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:25Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:30Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:35Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:40Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:45Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:50Z" level=info msg="Deleting Volumesnapshot app-12345-425/velero-app-12345-425-as-dc1-app-123456pfrq" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:485" pluginName=velero
time="2024-12-03T04:09:50Z" level=info msg="1 errors encountered backup up item" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:720" name=app-12345-425-as-dc1-app-12345-425-as-dc1-0
time="2024-12-03T04:09:50Z" level=error msg="Error backing up item" backup=velero/data-backup-with-hook error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=app-12345-425, name=velero-app-12345-425-as-dc1-app-123456pfrq): rpc error: code = Unknown desc = failed to get volumesnapshot app-12345-425/velero-app-12345-425-as-dc1-app-123456pfrq: volumesnapshots.snapshot.storage.k8s.io \"velero-app-12345-425-as-dc1-app-123456pfrq\" not found" logSource="pkg/backup/backup.go:724" name=app-12345-425-as-dc1-app-12345-425-as-dc1-0
time="2024-12-03T04:09:50Z" level=info msg="Excluding item because resource is cluster-scoped and is excluded by cluster filter." backup=velero/data-backup-with-hook logSource="pkg/backup/item_backupper.go:133" name=csi-d3a39800-f02e-4456-83ec-0127bba2ad71 namespace= resource=persistentvolumes
time="2024-12-03T04:09:50Z" level=info msg="Excluding item because resource is excluded" backup=velero/data-backup-with-hook logSource="pkg/backup/item_backupper.go:140" name=app-12345-425-as-dc1-0 namespace=app-12345-425 resource=pods
time="2024-12-03T04:09:50Z" level=info msg="Backed up 2 items out of an estimated total of 2 (estimate will change throughout the backup)" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:499" name=app-12345-425-as-dc1-app-12345-425-as-dc1-0 namespace=app-12345-425 progress= resource=persistentvolumeclaims
time="2024-12-03T04:09:50Z" level=info msg="Summary for skipped PVs: []" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:542"
time="2024-12-03T04:09:50Z" level=info msg="Backed up a total of 2 items" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:546" progress=
What did you expect to happen:
A successfully created backup.
Anything else you would like to add:
Creating a volume snapshot manually like this, but takes about 47 seconds to complete and be in the "Ready" state in kubernetes. We should be able to toggle the wait time to maybe a few minutes so that we can always succeed.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: test-manual-snapshot
namespace: app-12345-425
spec:
volumeSnapshotClassName: my-snapclass
source:
persistentVolumeClaimName: <claim_name>
Environment:
- Velero version (use
velero version
):
Client:
Version: v1.15.0
Git commit: -
Server:
Version: v1.15.0
- Velero features (use
velero client config get features
):
velero client config get features
features: <NOT SET>
- Kubernetes version (use
kubectl version
):
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1
- Cloud provider or hardware configuration: Oracle Cloud Kubernetes
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
- ๐ for "I would like to see this bug fixed as soon as possible"
- ๐ for "There are more important bugs to focus on right now"
kubectl get backup -oyaml
and see if timeout propagated to backup object velero would use to run the backup.
Also please give an example how you used --csi-snapshot-timeout
Also please give an example how you used
--csi-snapshot-timeout
Sorry, about missing that detail. I just updated the original shell command to show it. What's really strange is I have a very small volume that I can snapshot just fine, it's just these that have actual data (8gbs) that take longer and have this problem.
My demo volume is 50gigs but probably has less than 1mb of dummy data and it works fine to snapshot. The one with real data always fails.
I guess it is sending that value over? but then why is it timing out before it sees the volume contents in the attached logs ๐ค
For the record manual backups (outside of velero) work just fine, if I create a volume snapshot using kubernetes manifests. So Oracle CSI is working as expected and I've just tried a different backup scheduler and its not having issues at all.
I would like to keep using velero though since it has the neat CLI to go with it.
kubectl get backup -oyaml and see if timeout propagated to backup object velero would use to run the backup.
velero backup create manual-backup-11 \
--include-resources="pvc" \
--include-namespaces app-12345-425 \
--include-resources=persistentvolumeclaims \
--snapshot-volumes=true \
--selector='test=velero-backup' \
--csi-snapshot-timeout=20m
- apiVersion: velero.io/v1
kind: Backup
metadata:
annotations:
velero.io/resource-timeout: 10m0s
velero.io/source-cluster-k8s-gitversion: v1.29.1
velero.io/source-cluster-k8s-major-version: "1"
velero.io/source-cluster-k8s-minor-version: "29"
creationTimestamp: "2024-12-03T17:45:36Z"
generation: 6
labels:
velero.io/storage-location: default
name: manual-backup-11
namespace: velero
resourceVersion: "509394137"
uid: 62b9fd3d-abfe-4a3e-b804-0d2278e21d17
spec:
csiSnapshotTimeout: 20m0s
defaultVolumesToFsBackup: false
hooks: {}
includedNamespaces:
- app-12345-425
includedResources:
- persistentvolumeclaims
itemOperationTimeout: 4h0m0s
labelSelector:
matchLabels:
test: velero-backup
metadata: {}
snapshotMoveData: false
snapshotVolumes: true
storageLocation: default
ttl: 720h0m0s
status:
completionTimestamp: "2024-12-03T17:46:32Z"
errors: 1
expiration: "2025-01-02T17:45:41Z"
formatVersion: 1.1.0
hookStatus: {}
phase: PartiallyFailed
progress:
itemsBackedUp: 2
totalItems: 2
startTimestamp: "2024-12-03T17:45:41Z"
version: 1
Yours did not fail due to timeout. Otherwise you will see log "Timed out awaiting reconciliation of VolumeSnapshot, VolumeSnapshotContent %s has error: %v",
velero/pkg/util/csi/volume_snapshot.go
Lines 754 to 772 in 23ca089
So I changed, nothing. and it's working now on a schedule now.
app-12345-425-20241203190531 Completed 0 0 2024-12-03 12:05:31 -0700 MST 29d default test=velero-backup
app-12345-425-20241203190031 Completed 0 0 2024-12-03 12:00:31 -0700 MST 29d default test=velero-backup
app-12345-425-20241203185531 Completed 0 0 2024-12-03 11:55:31 -0700 MST 29d default test=velero-backup
app-12345-425-20241203185031 Completed 0 0 2024-12-03 11:50:31 -0700 MST 29d default test=velero-backup
app-12345-425-20241203184531 Completed 0 0 2024-12-03 11:45:31 -0700 MST 29d default test=velero-backup
Was it reliably reproducible (how many out of 10 fails etc.)? I think it might've been a oneoff.. we can add some retry in here to make it more reliable if apiserver is slow.
Was it reliably reproducible (how many out of 10 fails etc.)? I think it might've been a oneoff.. we can add some retry in here to make it more reliable if apiserver is slow.
Yes it was, for the past 24 hours it ran every 5 minutes and only one succeeded and now its succeeding everytime.
ok.. so not 100% reproducible given enough retries.. so env flakes.
I can try PR something. But since your env is now succeeding every time, I assume you won't be able to test it.
ok.. so not 100% reproducible given enough retries.. so env flakes.
this is what I did just before it started working:
1.) I installed / setup a schedule for backups using Backube instead of Velero
2.) I scaled down the statefulset to 0
3.) I deleted the PVC
4.) I restored using the below manifest
6.) I scaled back up the statefulset to 1
7.) Now the backups work.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: app-12345-425-as-as-dc1-app-app-12345-425-as-as-dc1-app-0
namespace: app-12345-425
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 250Gi
dataSource:
name: app-12345-425-as-as-dc1-app-app-12345-425-as-as-dc1-app-0-app-12345-425-snapshot-202412031840
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
Then I scaled backup my statefulset.
I can try PR something. But since your env is now succeeding every time, I assume you won't be able to test it.
I can still test it because I have a lot more deployments that need to be backed up as well.