vmware-tanzu/velero

velero completely ignores `--csi-snapshot-timeout` flag

jmtx1020 opened this issue ยท 14 comments

What steps did you take and what happened:

I created a volume like this:

velero backup create data-backup \
  --include-namespaces app-12345-425 \
  --include-resources=persistentvolumeclaims \
  --selector='test=velero-backup' \
  --csi-snapshot-timeout=20m

Results in the following logs. it actually does create the volume backup but deletes it. I've timed how long it takes to create a backup manually and it takes about 47 seconds - so I think the issue might be in the fact it's only checking for 50 seconds.

time="2024-12-03T04:09:00Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot app-12345-425/velero-app-12345-425-as-dc1-app-123456pfrq. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:712" pluginName=velero
time="2024-12-03T04:09:05Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:10Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:15Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:20Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:25Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:30Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:35Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:40Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:45Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:50Z" level=info msg="Deleting Volumesnapshot app-12345-425/velero-app-12345-425-as-dc1-app-123456pfrq" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:485" pluginName=velero
time="2024-12-03T04:09:50Z" level=info msg="1 errors encountered backup up item" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:720" name=app-12345-425-as-dc1-app-12345-425-as-dc1-0
time="2024-12-03T04:09:50Z" level=error msg="Error backing up item" backup=velero/data-backup-with-hook error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=app-12345-425, name=velero-app-12345-425-as-dc1-app-123456pfrq): rpc error: code = Unknown desc = failed to get volumesnapshot app-12345-425/velero-app-12345-425-as-dc1-app-123456pfrq: volumesnapshots.snapshot.storage.k8s.io \"velero-app-12345-425-as-dc1-app-123456pfrq\" not found" logSource="pkg/backup/backup.go:724" name=app-12345-425-as-dc1-app-12345-425-as-dc1-0
time="2024-12-03T04:09:50Z" level=info msg="Excluding item because resource is cluster-scoped and is excluded by cluster filter." backup=velero/data-backup-with-hook logSource="pkg/backup/item_backupper.go:133" name=csi-d3a39800-f02e-4456-83ec-0127bba2ad71 namespace= resource=persistentvolumes
time="2024-12-03T04:09:50Z" level=info msg="Excluding item because resource is excluded" backup=velero/data-backup-with-hook logSource="pkg/backup/item_backupper.go:140" name=app-12345-425-as-dc1-0 namespace=app-12345-425 resource=pods
time="2024-12-03T04:09:50Z" level=info msg="Backed up 2 items out of an estimated total of 2 (estimate will change throughout the backup)" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:499" name=app-12345-425-as-dc1-app-12345-425-as-dc1-0 namespace=app-12345-425 progress= resource=persistentvolumeclaims
time="2024-12-03T04:09:50Z" level=info msg="Summary for skipped PVs: []" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:542"
time="2024-12-03T04:09:50Z" level=info msg="Backed up a total of 2 items" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:546" progress=

What did you expect to happen:
A successfully created backup.

Anything else you would like to add:

Creating a volume snapshot manually like this, but takes about 47 seconds to complete and be in the "Ready" state in kubernetes. We should be able to toggle the wait time to maybe a few minutes so that we can always succeed.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: test-manual-snapshot
  namespace: app-12345-425
spec:
  volumeSnapshotClassName: my-snapclass
  source:
    persistentVolumeClaimName: <claim_name>

Environment:

  • Velero version (use velero version):
Client:
        Version: v1.15.0
        Git commit: -
Server:
        Version: v1.15.0
  • Velero features (use velero client config get features):
velero client config get features
features: <NOT SET>
  • Kubernetes version (use kubectl version):
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1
  • Cloud provider or hardware configuration: Oracle Cloud Kubernetes

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • ๐Ÿ‘ for "I would like to see this bug fixed as soon as possible"
  • ๐Ÿ‘Ž for "There are more important bugs to focus on right now"

kubectl get backup -oyaml
and see if timeout propagated to backup object velero would use to run the backup.

Also please give an example how you used --csi-snapshot-timeout

Also please give an example how you used --csi-snapshot-timeout

Sorry, about missing that detail. I just updated the original shell command to show it. What's really strange is I have a very small volume that I can snapshot just fine, it's just these that have actual data (8gbs) that take longer and have this problem.

My demo volume is 50gigs but probably has less than 1mb of dummy data and it works fine to snapshot. The one with real data always fails.

I guess it is sending that value over? but then why is it timing out before it sees the volume contents in the attached logs ๐Ÿค”

For the record manual backups (outside of velero) work just fine, if I create a volume snapshot using kubernetes manifests. So Oracle CSI is working as expected and I've just tried a different backup scheduler and its not having issues at all.

I would like to keep using velero though since it has the neat CLI to go with it.

kubectl get backup -oyaml and see if timeout propagated to backup object velero would use to run the backup.

velero backup create manual-backup-11 \
  --include-resources="pvc" \
  --include-namespaces app-12345-425 \
  --include-resources=persistentvolumeclaims \
  --snapshot-volumes=true \
  --selector='test=velero-backup' \
  --csi-snapshot-timeout=20m
- apiVersion: velero.io/v1
  kind: Backup
  metadata:
    annotations:
      velero.io/resource-timeout: 10m0s
      velero.io/source-cluster-k8s-gitversion: v1.29.1
      velero.io/source-cluster-k8s-major-version: "1"
      velero.io/source-cluster-k8s-minor-version: "29"
    creationTimestamp: "2024-12-03T17:45:36Z"
    generation: 6
    labels:
      velero.io/storage-location: default
    name: manual-backup-11
    namespace: velero
    resourceVersion: "509394137"
    uid: 62b9fd3d-abfe-4a3e-b804-0d2278e21d17
  spec:
    csiSnapshotTimeout: 20m0s
    defaultVolumesToFsBackup: false
    hooks: {}
    includedNamespaces:
    - app-12345-425
    includedResources:
    - persistentvolumeclaims
    itemOperationTimeout: 4h0m0s
    labelSelector:
      matchLabels:
        test: velero-backup
    metadata: {}
    snapshotMoveData: false
    snapshotVolumes: true
    storageLocation: default
    ttl: 720h0m0s
  status:
    completionTimestamp: "2024-12-03T17:46:32Z"
    errors: 1
    expiration: "2025-01-02T17:45:41Z"
    formatVersion: 1.1.0
    hookStatus: {}
    phase: PartiallyFailed
    progress:
      itemsBackedUp: 2
      totalItems: 2
    startTimestamp: "2024-12-03T17:45:41Z"
    version: 1

Yours did not fail due to timeout. Otherwise you will see log "Timed out awaiting reconciliation of VolumeSnapshot, VolumeSnapshotContent %s has error: %v",

if err != nil {
if wait.Interrupted(err) {
if vsc != nil &&
vsc.Status != nil &&
vsc.Status.Error != nil {
log.Errorf(
"Timed out awaiting reconciliation of VolumeSnapshot, VolumeSnapshotContent %s has error: %v",
vsc.Name, *vsc.Status.Error.Message)
return nil,
errors.Errorf("CSI got timed out with error: %v",
*vsc.Status.Error.Message)
} else {
log.Errorf(
"Timed out awaiting reconciliation of volumesnapshot %s/%s",
volSnap.Namespace, volSnap.Name)
}
}
return nil, err
}

So I changed, nothing. and it's working now on a schedule now.

app-12345-425-20241203190531   Completed         0        0          2024-12-03 12:05:31 -0700 MST   29d       default            test=velero-backup
app-12345-425-20241203190031   Completed         0        0          2024-12-03 12:00:31 -0700 MST   29d       default            test=velero-backup
app-12345-425-20241203185531   Completed         0        0          2024-12-03 11:55:31 -0700 MST   29d       default            test=velero-backup
app-12345-425-20241203185031   Completed         0        0          2024-12-03 11:50:31 -0700 MST   29d       default            test=velero-backup
app-12345-425-20241203184531   Completed         0        0          2024-12-03 11:45:31 -0700 MST   29d       default            test=velero-backup

Was it reliably reproducible (how many out of 10 fails etc.)? I think it might've been a oneoff.. we can add some retry in here to make it more reliable if apiserver is slow.

Was it reliably reproducible (how many out of 10 fails etc.)? I think it might've been a oneoff.. we can add some retry in here to make it more reliable if apiserver is slow.

Yes it was, for the past 24 hours it ran every 5 minutes and only one succeeded and now its succeeding everytime.

ok.. so not 100% reproducible given enough retries.. so env flakes.

I can try PR something. But since your env is now succeeding every time, I assume you won't be able to test it.

ok.. so not 100% reproducible given enough retries.. so env flakes.

this is what I did just before it started working:
1.) I installed / setup a schedule for backups using Backube instead of Velero
2.) I scaled down the statefulset to 0
3.) I deleted the PVC
4.) I restored using the below manifest
6.) I scaled back up the statefulset to 1
7.) Now the backups work.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-12345-425-as-as-dc1-app-app-12345-425-as-as-dc1-app-0
  namespace: app-12345-425
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 250Gi
  dataSource:
    name: app-12345-425-as-as-dc1-app-app-12345-425-as-as-dc1-app-0-app-12345-425-snapshot-202412031840 
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

Then I scaled backup my statefulset.

I can try PR something. But since your env is now succeeding every time, I assume you won't be able to test it.

I can still test it because I have a lot more deployments that need to be backed up as well.

@jmtx1020 do you have argo by any chance?

If so, you could be hitting #7905

ghcr.io/kaovilai/velero:verboseCleanup-r1.15-b26183826
I made an image for you from #8480 btw.