Unable to trigger a PVC swap based on annotation
MassimoVlacancich opened this issue · 3 comments
What happened?
After installing Gemini on my cluster and following the provided instructions to backup a Volume, I was unable to re-instate an older snapshot.
@
What did you expect to happen?
We let Gemini create a first snapshot.
We then wrote some data by hand at the mount point of the volume being backed-up (details below)
We let Gemini create a second snapshot.
We then followed to below commands to backup to the first snapshot, where the file should not be present.
kubectl scale all --all --replicas=0
kubectl annotate snapshotgroup/dev-postgres-backup --overwrite "gemini.fairwinds.com/restore=1711982369"
kubectl scale all --all --replicas=1
But despite this, when navigating to the mountpoint within the postgres pod which mounts the volume claim being backed-up, we still see the file.
In short, the backup doesn't seem to be working as expected; the same applies when writing data within the DB which writes it to the mount point pgdata directory.
How can we reproduce this?
We are using k8s 1.25
and installed the latest version of Gemini with v2 CRDs
(fyi, I don't think the CRD for v1beta1
exists at https://raw.githubusercontent.com/FairwindsOps/gemini/main/pkg/types/snapshotgroup/v1beta1/crd-with-beta1.yaml
, we instead installed the one at https://raw.githubusercontent.com/FairwindsOps/gemini/main/pkg/types/snapshotgroup/v1/crd-with-beta1.yaml
(/v1beta1 vs /v1)
We are using a PVC to provide a mount point where our postgres-db can write its data, below is the config (simplified for convenience):
apiVersion: apps/v1
kind: Deployment
metadata:
name: dev-postgres
namespace: dev
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
envFrom:
- configMapRef:
name: dev-postgres-config
image: postgres:latest
imagePullPolicy: IfNotPresent
name: postgres
ports:
- containerPort: 5432
volumeMounts:
- mountPath: /var/lib/postgresql/data
name: postgresdb-data-volume
hostname: postgres
volumes:
- name: postgresdb-data-volume
persistentVolumeClaim:
claimName: dev-postgres-claim
The persistent volume is only provisioned Dynamically by GCP once the volume claim manifest is applied and the DB mounts it:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
labels:
app: postgres
name: dev-postgres-claim
namespace: dev
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: standard-rwo
We have specifically adjusted this to be a standard-rwo
(ReadWriteOnce) as to ensure that the data on the volume isn’t being modified by multiple nodes at the same time when a snapshot is being taken.
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-4a079b5e-91f8-400f-97ca-99ea609b4f4e 2Gi RWO Delete Bound dev/dev-postgres-claim standard-rwo 14d
Moreover, we defined our snapshot class as follows before adding the snapshot config
apiVersion: snapshot.storage.k8s.io/v1
deletionPolicy: Delete
driver: pd.csi.storage.gke.io
kind: VolumeSnapshotClass
metadata:
annotations:
snapshot.storage.kubernetes.io/is-default-class: "true"
name: dev-gcp-csi-snapshotclass
namespace: dev
---
apiVersion: gemini.fairwinds.com/v1
kind: SnapshotGroup
metadata:
name: dev-postgres-backup
namespace: dev
spec:
persistentVolumeClaim:
claimName: dev-postgres-claim
schedule:
- every: 5 minutes
keep: 3
template:
spec:
volumeSnapshotClassName: dev-gcp-csi-snapshotclass
The above works as expected once in place and we see the snapshots being created and in a ready
state
NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE
dev-postgres-backup-1711982369 true dev-postgres-claim 2Gi dev-gcp-csi-snapshotclass snapcontent-b857786d-aa1c-4e82-9b2b-50f801c8c6ef 17m 17m
dev-postgres-backup-1711982669 true dev-postgres-claim 2Gi dev-gcp-csi-snapshotclass snapcontent-16337f17-f2bc-41a8-b630-581b5f06f888 12m 12m
dev-postgres-backup-1711982969 true dev-postgres-claim 2Gi dev-gcp-csi-snapshotclass snapcontent-2c126506-c999-4dba-930c-009034405a4d 7m33s 7m37s
dev-postgres-backup-1711983269 true dev-postgres-claim 2Gi dev-gcp-csi-snapshotclass snapcontent-b656ab5a-ea09-4ba3-ad00-4dc6bb4e110f 2m33s 2m37s
As detailed above, we then wrote some data by hand (added a file in the mount location /var/lib/postgresql/data
between snapshots 1 and 2 above (1711982369
does not have the file, while 1711982669
does).
When then run the following commands to re-instore the first snapshot without a file
kubectl scale all --all --replicas=0
kubectl annotate snapshotgroup/dev-postgres-backup --overwrite "gemini.fairwinds.com/restore=1711982369"
kubectl scale all --all --replicas=1
But despite this, when navigating to /var/lib/postgresql/data
within the postgres pod which mounts the volume claim being backedup, we still see the file.
What I also find interesting is that the PVC still shows its age to be 14d ago, I would expect this to be a brand new one re-instated from snapshot.
When investigating the logs for the gemini-controller
pod I don't see any specific errors after the restart post-annotation, nor do I see anything that points to the swap being successfull:
I0401 14:25:59.640654 1 controller.go:179] Starting SnapshotGroup controller
I0401 14:25:59.640679 1 controller.go:181] Waiting for informer caches to sync
I0401 14:25:59.641048 1 reflector.go:287] Starting reflector *v1.SnapshotGroup (30s) from pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231
I0401 14:25:59.641071 1 reflector.go:323] Listing and watching *v1.SnapshotGroup from pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231
I0401 14:25:59.740942 1 shared_informer.go:341] caches populated
I0401 14:25:59.741060 1 controller.go:186] Starting workers
I0401 14:25:59.741136 1 controller.go:191] Started workers
I0401 14:25:59.741204 1 groups.go:38] dev/dev-postgres-backup: reconciling
I0401 14:25:59.750629 1 pvc.go:48] dev/dev-postgres-claim: PVC found
I0401 14:25:59.750661 1 groups.go:29] dev/dev-postgres-backup: updating PVC spec
W0401 14:25:59.761896 1 warnings.go:70] unknown field "spec.persistentVolumeClaim.spec.volumeMode"
W0401 14:25:59.761922 1 warnings.go:70] unknown field "spec.template.spec.source"
W0401 14:25:59.761928 1 warnings.go:70] unknown field "status"
I0401 14:25:59.770677 1 groups.go:53] dev/dev-postgres-backup: found 2 existing snapshots
I0401 14:25:59.770713 1 scheduler.go:58] Checking snapshot dev/dev-postgres-backup-1711981440
I0401 14:25:59.770723 1 scheduler.go:58] Checking snapshot dev/dev-postgres-backup-1711981140
I0401 14:25:59.770731 1 scheduler.go:91] need creation 5 minutes false
I0401 14:25:59.770739 1 groups.go:59] dev/dev-postgres-backup: going to create 0, delete 0 snapshots
I0401 14:25:59.770746 1 snapshots.go:204] Deleting 0 expired snapshots
I0401 14:25:59.770755 1 groups.go:65] dev/dev-postgres-backup: deleted 0 snapshots
I0401 14:25:59.770762 1 groups.go:71] dev/dev-postgres-backup: created 0 snapshots
I0401 14:25:59.770790 1 controller.go:144] dev/dev-postgres-backup: successfully performed backup
I0401 14:26:29.648307 1 reflector.go:376] pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231: forcing resync
I0401 14:26:29.648435 1 groups.go:38] dev/dev-postgres-backup: reconciling
I0401 14:26:29.656518 1 pvc.go:48] dev/dev-postgres-claim: PVC found
I0401 14:26:29.656643 1 groups.go:29] dev/dev-postgres-backup: updating PVC spec
W0401 14:26:29.665339 1 warnings.go:70] unknown field "spec.persistentVolumeClaim.spec.volumeMode"
W0401 14:26:29.665368 1 warnings.go:70] unknown field "spec.template.spec.source"
W0401 14:26:29.665373 1 warnings.go:70] unknown field "status"
I0401 14:26:29.673816 1 groups.go:53] dev/dev-postgres-backup: found 2 existing snapshots
I0401 14:26:29.673851 1 scheduler.go:58] Checking snapshot dev/dev-postgres-backup-1711981440
I0401 14:26:29.673861 1 scheduler.go:58] Checking snapshot dev/dev-postgres-backup-1711981140
I0401 14:26:29.673869 1 scheduler.go:91] need creation 5 minutes false
I0401 14:26:29.673876 1 groups.go:59] dev/dev-postgres-backup: going to create 0, delete 0 snapshots
I0401 14:26:29.673884 1 snapshots.go:204] Deleting 0 expired snapshots
We would appreciate your input in resolving this, maybe this has to do with our cluster set up or maybe a config issue with PVs; I've requested access to the slack channel, waiting on approval :)
Thanks in advance,
Massimo
Version
Version 2.0 - Kubernetes 1.25
Search
- I did search for other open and closed issues before opening this.
Code of Conduct
- I agree to follow this project's Code of Conduct
Additional context
No response
Hi team, could I seek your help on the above please? Happy to provide more details if required :)
Hi all, just chasing again, we are keen to rely on Gemini :)
Hi all, chasing again, would appreciate some help on this one :)