[Feature] Point In Time Recovery (PITR)
Opened this issue · 1 comments
Describe the solution you'd like
Ability to restore the state of MariaDB
in a particular point in time, minimizing the RPO (data loss) and RTO (time to recover) as much as possible.
For achieving this we will need to combine:
- Full physical backups: It is a snapshot of the current database state files kept in an object store (S3 compatible). Periodically taken by a
CronJob
that mounts theMariaDB
PVC. - Binlog archival: Binary log files that contain the database events that happened after the full physical backup was taken. Making use of parallelism, our sidecar agent will periodically archive the binlog files in an object store (S3 compatible).
In order to configure PITR in a MariaDB
, we will provide the following specification:
apiVersion: k8s.mariadb.com/v1alpha1
kind: MariaDB
metadata:
name: mariadb-pitr-blue
spec:
...
pitr:
enabled: true
targetPod: Primary # Replica
fullBackup:
schedule:
cron: "* */1 * * *"
suspend: false
maxRetention: 1440h # 30 days
compression: gzip
storage:
s3:
bucket: full-backup
endpoint: minio.minio.svc.cluster.local:9000
accessKeyIdSecretKeyRef:
name: minio
key: access-key-id
secretAccessKeySecretKeyRef:
name: minio
key: secret-access-key
tls:
enabled: true
caSecretKeyRef:
name: minio-ca
key: ca.crt
binlog:
archiveInterval: 5m
archiveTimeout: 5m
maxParallel: 10
maxRetention: 720h # 30 days
compression: gzip
storage:
s3:
bucket: binlog
endpoint: minio.minio.svc.cluster.local:9000
accessKeyIdSecretKeyRef:
name: minio
key: access-key-id
secretAccessKeySecretKeyRef:
name: minio
key: secret-access-key
tls:
enabled: true
caSecretKeyRef:
name: minio-ca
key: ca.crt
This will populate the object stores with both full physical backups and binary logs, which can be used later on to bootstrap a new MariaDB
instance:
apiVersion: k8s.mariadb.com/v1alpha1
kind: MariaDB
metadata:
name: mariadb-pitr-green
spec:
...
bootstrapFrom:
pitr:
fullBackup:
storage:
s3:
bucket: full-backup
endpoint: minio.minio.svc.cluster.local:9000
accessKeyIdSecretKeyRef:
name: minio
key: access-key-id
secretAccessKeySecretKeyRef:
name: minio
key: secret-access-key
tls:
enabled: true
caSecretKeyRef:
name: minio-ca
key: ca.crt
binlog:
storage:
s3:
bucket: full-backup
endpoint: minio.minio.svc.cluster.local:9000
accessKeyIdSecretKeyRef:
name: minio
key: access-key-id
secretAccessKeySecretKeyRef:
name: minio
key: secret-access-key
tls:
enabled: true
caSecretKeyRef:
name: minio-ca
key: ca.crt
targetRecoveryTime: 2023-12-19T09:00:00Z
It is important to note that the initial restoration/bootstrapping process happens offline before the MariaDB
instance is provisioned, so there is no impact in the database performance nor its availability. This will be done by either an init Job
(Galera) or an init container on each instance (replication).
Initially, the plan is to use this disaster recovery strategy for Galera, as we are already making use of the agent, required for log archival. However, the replication architecture could catch up with Galera and make use of the agent so the PITR can also be performed. Not only this would be beneficial for backing up and restore replication clusters, but also for spining up new replicas or restoring existing ones. We currently have multiple issues opened where either the replica is in a bad state and needs to be restored or the master has purged binary logs and it is not possible to perform the full replication:
Regarding the physical backups with maria-backup
, this has been attempted already and there is a WIP PR:
Also, this PITR feature would superseed the initial strategy we considered for physical backups with maria-backup
:
Describe alternatives you've considered
Our current Backup
and Restore
CRs are based on logical backups (SQL dumps), which work well with small databases but they are not ideal for critical bussiness scenarios:
- Large logical backups can take a lot of time to be restored, leading to increased RTO.
- They are not taken continuously, leading to increased RPO.
- The restoration process is done by a Kubernetes
Job
pointing to a running database, which might imply an overhead in performance or even unavailability in the worst case scenario.
Additional context
- https://aws.amazon.com/blogs/mt/establishing-rpo-and-rto-targets-for-cloud-applications/#:~:text=As%20a%20quick%20refresher%2C%20RTO,loss%20your%20application%20can%20tolerate.
- https://mariadb.com/kb/en/mariabackup/
- https://mariadb.com/kb/en/overview-of-mariadb-logs/#overview-of-the-binary-logbinary-log
- https://mariadb.com/kb/en/mariadb-binlog/
- https://minervadb.xyz/step-by-step-guide-to-point-in-time-recovery-in-mariadb-using-mariabackup/