kubernetes-csi/external-provisioner

For Veritas InfoScale CSI provisioner integration with velero-csi and aws plugin restoration from snapshot fails

Shreyashirwadkar opened this issue · 5 comments

Using Velero CSI plugin with CSI snapshots enabled for creating backups.
Below is the command we used for installing velero,

velero install --provider aws --features=EnableCSI --plugins=velero/velero-plugin-for-csi:v0.4.0,velero/velero-plugin-for-aws:v1.6.0 --bucket mybkt --secret-file ./credentials-velero --use-volume-snapshots=True --backup-location-config region=minio,s3ForcePathStyle=True,s3Url=http://xx.xx.xx.xx:9000 ,publicUrl=http://xx.xx.xx.xx:9000 --secret-file ./credentials-velero --use-volume-snapshots=True --backup-location-config region=minio,s3ForcePathStyle=True,s3Url=http://xx.xx.xx:9000, publicUrl=http://xx.xx.xx.xx:9000 --snapshot-location-config region=default,profile=default

Using velero backup command for creating namespace backups ,

velero backup create postgres-backup-test --include-namespaces=postgres --wait

velero backup get

NAME                   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
postgres-backup-test   Completed   0        0          2022-12-22 18:02:19 +0530 IST   29d       default            <none>

This backup is creating volumesnapshot and volumesnapshotcontent correctly.

But when we delete namespace try to restore it from backup , snapshot is not getting created correctly because of which underlying PVC and pod goes into pending state. We have seen below errors in csi-snapshotter

I1222 13:47:54.947439       1 connection.go:183] GRPC call: /csi.v1.Controller/ControllerGetCapabilities
I1222 13:47:54.947443       1 connection.go:184] GRPC request: {}
I1222 13:47:54.948341       1 connection.go:186] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":5}}},{"Type":{"Rpc":{"type":6}}},{"Type":{"Rpc":{"type":7}}}]}
I1222 13:47:54.948429       1 connection.go:187] GRPC error: <nil>
I1222 13:47:54.948434       1 connection.go:183] GRPC call: /csi.v1.Controller/ListSnapshots
I1222 13:47:54.948437       1 connection.go:184] GRPC request: {"snapshot_id":"snap_6k4f479bv8axq9hnu44n"}
I1222 13:47:55.310029       1 connection.go:186] GRPC response: {}
I1222 13:47:55.310089       1 connection.go:187] GRPC error: rpc error: code = Internal desc = parsing time "01:47:55 PM, +0000, UTC" as "2006-01-02 15:04": cannot parse "7:55 PM, +0000, UTC" as "2006"
E1222 13:47:55.310111       1 snapshot_controller.go:267] checkandUpdateContentStatusOperation: failed to call get snapshot status to check whether snapshot is ready to use "failed to list snapshot for content velero-velero-data-pvc-klw7f-rqzxh: \"rpc error: code = Internal desc = parsing time \\\"01:47:55 PM, +0000, UTC\\\" as \\\"2006-01-02 15:04\\\": cannot parse \\\"7:55 PM, +0000, UTC\\\" as \\\"2006\\\"\""
I1222 13:47:55.310121       1 snapshot_controller.go:143] updateContentStatusWithEvent[velero-velero-data-pvc-klw7f-rqzxh]
I1222 13:47:55.313204       1 snapshot_controller.go:189] updating VolumeSnapshotContent[velero-velero-data-pvc-klw7f-rqzxh] error status failed volumesnapshotcontents.snapshot.storage.k8s.io "velero-velero-data-pvc-klw7f-rqzxh" is forbidden: User "system:serviceaccount:infoscale-vtas:infoscale-csi-controller-17189" cannot patch resource "volumesnapshotcontents/status" in API group "snapshot.storage.k8s.io" at the cluster scope
I1222 13:47:55.313332       1 event.go:285] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"velero-velero-data-pvc-klw7f-rqzxh", UID:"f7b313d3-c1c8-4991-aa4a-c2d1e5a142f8", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"562474614", FieldPath:""}): type: 'Warning' reason: 'SnapshotContentCheckandUpdateFailed' Failed to check and update snapshot content: failed to list snapshot for content velero-velero-data-pvc-klw7f-rqzxh: "rpc error: code = Internal desc = parsing time \"01:47:55 PM, +0000, UTC\" as \"2006-01-02 15:04\": cannot parse \"7:55 PM, +0000, UTC\" as \"2006\""
E1222 13:47:55.313234       1 snapshot_controller.go:124] checkandUpdateContentStatus [velero-velero-data-pvc-klw7f-rqzxh]: error occurred failed to list snapshot for content velero-velero-data-pvc-klw7f-rqzxh: "rpc error: code = Internal desc = parsing time \"01:47:55 PM, +0000, UTC\" as \"2006-01-02 15:04\": cannot parse \"7:55 PM, +0000, UTC\" as \"2006\""
E1222 13:47:55.313401       1 snapshot_controller_base.go:265] could not sync content "velero-velero-data-pvc-klw7f-rqzxh": failed to list snapshot for content velero-velero-data-pvc-klw7f-rqzxh: "rpc error: code = Internal desc = parsing time \"01:47:55 PM, +0000, UTC\" as \"2006-01-02 15:04\": cannot parse \"7:55 PM, +0000, UTC\" as \"2006\""
I1222 13:47:55.313430       1 snapshot_controller_base.go:167] Failed to sync content "velero-velero-data-pvc-klw7f-rqzxh", will retry again: failed to list snapshot for content velero-velero-data-pvc-klw7f-rqzxh: "rpc error: code = Internal desc = parsing time \"01:47:55 PM, +0000, UTC\" as \"2006-01-02 15:04\": cannot parse \"7:55 PM, +0000, UTC\" as \"2006\""
oc get volumesnapshotcontents.snapshot.storage.k8s.io
NAME                                               READYTOUSE   RESTORESIZE   DELETIONPOLICY   DRIVER                  VOLUMESNAPSHOTCLASS       VOLUMESNAPSHOT          VOLUMESNAPSHOTNAMESPACE   AGE
snapcontent-51de52a1-a24e-4bab-b3ae-a5033281606c   true         1073741824    Retain           org.veritas.infoscale   csi-infoscale-snapclass   velero-data-pvc-klw7f   postgres                  75m << snapshotcontent created during backup.
velero-velero-data-pvc-klw7f-rqzxh                                            Retain           org.veritas.infoscale   csi-infoscale-snapclass   velero-data-pvc-klw7f   postgres                  32m. <<snapshotcontent created while restoring from backup.

oc get volumesnapshotclasses.snapshot.storage.k8s.io
NAME                      DRIVER                   DELETIONPOLICY   AGE
csi-infoscale-snapclass   org.veritas.infoscale    Retain           10d
csi-vsphere-vsc           csi.vsphere.vmware.com   Delete           14d
[root@bastion ~]# oc describe volumesnapshotclasses.snapshot.storage.k8s.io csi-infoscale-snapclass|grep -i label
Labels:           velero.io/csi-volumesnapshot-class=true
        f:labels:
    Manager:      kubectl-label
oc get volumesnapshots.snapshot.storage.k8s.io -n postgres
NAME                    READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT                RESTORESIZE   SNAPSHOTCLASS             SNAPSHOTCONTENT                      CREATIONTIME   AGE
velero-data-pvc-klw7f   false                    velero-velero-data-pvc-klw7f-rqzxh                 csi-infoscale-snapclass   velero-velero-data-pvc-klw7f-rqzxh                  32m

oc get volumesnapshotcontents.snapshot.storage.k8s.io --all-namespaces
NAME                                               READYTOUSE   RESTORESIZE   DELETIONPOLICY   DRIVER                  VOLUMESNAPSHOTCLASS       VOLUMESNAPSHOT          VOLUMESNAPSHOTNAMESPACE   AGE
snapcontent-51de52a1-a24e-4bab-b3ae-a5033281606c   true         1073741824    Retain           org.veritas.infoscale   csi-infoscale-snapclass   velero-data-pvc-klw7f   postgres                  12d
velero-velero-data-pvc-klw7f-rqzxh                                            Retain           org.veritas.infoscale   csi-infoscale-snapclass   velero-data-pvc-klw7f   postgres                  12d

oc get volumesnapshots --all-namespaces
NAMESPACE   NAME                    READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT                RESTORESIZE   SNAPSHOTCLASS             SNAPSHOTCONTENT                      CREATIONTIME   AGE
postgres    velero-data-pvc-klw7f   false                    velero-velero-data-pvc-klw7f-rqzxh                 csi-infoscale-snapclass   velero-velero-data-pvc-klw7f-rqzxh                  12d

What did you expect to happen:
Restoring from Velero backup should happen correctly. We have tested for earlier release with CSI v0.1.0 and it was working well.

Environment:

Velero version (use velero version): 1.10
Velero features (use velero client config get features): Velero CSI, AWS
Kubernetes version (use kubectl version): Kubernetes Version: v1.24.0+dc5a2fd
Kubernetes installer & version:
Cloud provider or hardware configuration: OpenShift 4.11/4.10
OS (e.g. from /etc/os-release): coreos

But when we delete namespace try to restore it from backup , snapshot is not getting created correctly because of which underlying PVC and pod goes into pending state. We have seen below errors in csi-snapshotter

Why do you need to create a snapshot when you are doing a restore? Can you provide the restore command you run?

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.