mariadb-operator/mariadb-operator

[Feature] Point In Time Recovery (PITR)

Opened this issue · 1 comments

Describe the solution you'd like
Ability to restore the state of MariaDB in a particular point in time, minimizing the RPO (data loss) and RTO (time to recover) as much as possible.

For achieving this we will need to combine:

  • Full physical backups: It is a snapshot of the current database state files kept in an object store (S3 compatible). Periodically taken by a CronJob that mounts the MariaDB PVC.
  • Binlog archival: Binary log files that contain the database events that happened after the full physical backup was taken. Making use of parallelism, our sidecar agent will periodically archive the binlog files in an object store (S3 compatible).

In order to configure PITR in a MariaDB, we will provide the following specification:

apiVersion: k8s.mariadb.com/v1alpha1
kind: MariaDB
metadata:
  name: mariadb-pitr-blue
spec:
  ...
  pitr:
    enabled: true
    targetPod: Primary # Replica
    fullBackup:
      schedule:
        cron: "* */1 * * *"
        suspend: false
      maxRetention: 1440h # 30 days
      compression: gzip
      storage:
        s3:
          bucket: full-backup
          endpoint: minio.minio.svc.cluster.local:9000
          accessKeyIdSecretKeyRef:
            name: minio
            key: access-key-id
          secretAccessKeySecretKeyRef:
            name: minio
            key: secret-access-key
          tls:
            enabled: true
            caSecretKeyRef:
              name: minio-ca
              key: ca.crt
    binlog:
      archiveInterval: 5m
      archiveTimeout: 5m
      maxParallel: 10
      maxRetention: 720h # 30 days
      compression: gzip
      storage:
        s3:
          bucket: binlog
          endpoint: minio.minio.svc.cluster.local:9000
          accessKeyIdSecretKeyRef:
            name: minio
            key: access-key-id
          secretAccessKeySecretKeyRef:
            name: minio
            key: secret-access-key
          tls:
            enabled: true
            caSecretKeyRef:
              name: minio-ca
              key: ca.crt  

This will populate the object stores with both full physical backups and binary logs, which can be used later on to bootstrap a new MariaDB instance:

apiVersion: k8s.mariadb.com/v1alpha1
kind: MariaDB
metadata:
  name: mariadb-pitr-green
spec:
  ...
  bootstrapFrom:
    pitr:
      fullBackup:
        storage:
          s3:
            bucket: full-backup
            endpoint: minio.minio.svc.cluster.local:9000
            accessKeyIdSecretKeyRef:
              name: minio
              key: access-key-id
            secretAccessKeySecretKeyRef:
              name: minio
              key: secret-access-key
            tls:
              enabled: true
              caSecretKeyRef:
                name: minio-ca
                key: ca.crt
      binlog:
        storage:
          s3:
            bucket: full-backup
            endpoint: minio.minio.svc.cluster.local:9000
            accessKeyIdSecretKeyRef:
              name: minio
              key: access-key-id
            secretAccessKeySecretKeyRef:
              name: minio
              key: secret-access-key
            tls:
              enabled: true
              caSecretKeyRef:
                name: minio-ca
                key: ca.crt 
    targetRecoveryTime: 2023-12-19T09:00:00Z 

It is important to note that the initial restoration/bootstrapping process happens offline before the MariaDB instance is provisioned, so there is no impact in the database performance nor its availability. This will be done by either an init Job (Galera) or an init container on each instance (replication).

Initially, the plan is to use this disaster recovery strategy for Galera, as we are already making use of the agent, required for log archival. However, the replication architecture could catch up with Galera and make use of the agent so the PITR can also be performed. Not only this would be beneficial for backing up and restore replication clusters, but also for spining up new replicas or restoring existing ones. We currently have multiple issues opened where either the replica is in a bad state and needs to be restored or the master has purged binary logs and it is not possible to perform the full replication:

Regarding the physical backups with maria-backup, this has been attempted already and there is a WIP PR:

Also, this PITR feature would superseed the initial strategy we considered for physical backups with maria-backup:

Describe alternatives you've considered
Our current Backup and Restore CRs are based on logical backups (SQL dumps), which work well with small databases but they are not ideal for critical bussiness scenarios:

  • Large logical backups can take a lot of time to be restored, leading to increased RTO.
  • They are not taken continuously, leading to increased RPO.
  • The restoration process is done by a Kubernetes Job pointing to a running database, which might imply an overhead in performance or even unavailability in the worst case scenario.

Additional context

Backup compression can be tackled as part of the following issue:

It should cover logical backups and PITR backup compression.