This is a bash script that can be used for backing up GCP Cloud SQL instances. It works by restoring an existing GCP managed backup to a new ephemeral db instance, and exporting an SQL dump from there into a GCS bucket.
The cloud_sql_backup.sh
script can be used on its own, or in a Docker image we've prepared.
-
An automated or on-demand GCP managed backup. Go here for help enabling.
-
The host running the script must have these tools installed:
- cut
- date
- gcloud
- head
- sed
- tr
The script requires some environment variables in order to function correctly:
Name | Description | Example |
---|---|---|
DB_NAME | The database name that'll be exported to GCS | "my-db" |
INSTANCE_CPU | vCPUs of the ephemeral instance | "4" |
INSTANCE_ENV | Name of environment the backup process is running in. It's used in the ephemeral instance name | "nonprod" |
INSTANCE_MEM | Memory of instance (multiple of 256 MiB, min 3840MiB) | "7680MiB" |
INSTANCE_NAME_PREFIX | Prefix to add to the start of instance name | "my-backup" |
INSTANCE_REGION | Instance region | "europe-west1" |
INSTANCE_STORAGE_TYPE | SSD (default) or HDD | "SSD" |
PROJECT | The GCP project | "my-gcp-project" |
SA_KEY_FILEPATH | The path to the GCP service account's .json key | "/secrets/gcp/backup-key.json" |
SOURCE_BACKUP_INSTANCE | Name of instance that you want backing up | "uat-db" |
TARGET_BACKUP_BUCKET | URI of GCS bucket that the backup file will get written to | "gs://my-gcs-bucket" |
Optionally, you can set some env vars for metrics:
Name | Description | Example |
---|---|---|
DATADOG_API_KEY | The Datadog API key | "57dsfjho34kj3h4oknk4jj55" |
TEAM | Team name, used for tagging metrics | "my-team" |
The following metrics should appear in Datadog (depending on the success of subsequent backup attempts):
cloud.sql.backup.started.count
cloud.sql.backup.success.count
cloud.sql.backup.failure.count
cloud.sql.backup.cleanup.count
All metrics are of type: count
, and a value of 1 is posted when each phase of
the backup process has been reached.
GCP provide their own automated daily backup process, which is very easy to enable, so why is this required?
According to GCP, "All data on an instance, including backups, is permanently lost when that instance is deleted". They recommend exporting your data to cloud storage for longer term persistence. That export process, however, can have a serious impact on the performance of your database while it's taking place.
This script solves these problems by:
- Creating a database instance from the latest automated GCP-managed backup
- Giving the database instance the required permissions to write to a GCS bucket
- Exporting to an SQL dump file in a GCS bucket
The ephemeral database instance and elevated permissions are then deleted/removed.
When this script runs using a GCP service account, it'll need a specific set of permissions in order to elevate the database instance service account's permissions, and check that the SQL dump file has appeared in the GCS bucket:
permissions = [
"cloudsql.backupRuns.get",
"cloudsql.backupRuns.list",
"cloudsql.instances.create",
"cloudsql.instances.delete",
"cloudsql.instances.export",
"cloudsql.instances.get",
"cloudsql.instances.restoreBackup",
"storage.buckets.get",
"storage.buckets.getIamPolicy",
"storage.buckets.setIamPolicy",
"storage.buckets.update",
"storage.objects.get"
]
The service account must have OWNER-role access to the TARGET_BACKUP_BUCKET
.
GCP creates a temporary service account during the export process, which needs
to be able to write to your GCS bucket. The service account you provide needs
enough permission to elevate the temporary service account's permissions on the
bucket (they get revoked again after being used).
The least privileged role for the script to succeed, as detailed here, is permissive. The "cloudsql.instances.delete"
permission alone will allow anyone with the key, for the GCP service account you use, to delete any Cloud SQL instance in its GCP project.
The final step in cloud_sql_backup.sh
is to delete the ephemeral db instance that's been used to create the SQL dump. There's a hard-coded check at this point in the script to only perform the instance deletion when the instance name contains the string "backup"
.
To prevent a race condition, the script creates a name, to be used for both the ephemeral instance and the S3 SQL dump object, that's suffixed by a 5 character random string:
name = <instance_name_prefix>-<instance_env>-<timestamp>-<gcp_managed_backup_id>-<random_string>
timestamp
is obtained by executingdate +%Y%m%d%H%M%S
instance_name_prefix
andinstance_env
are obtained from env varsgcp_managed_backup_id
is the ID of the latest GCP managed backuprandom_string
is the value ofLC_ALL=C tr -dc 'a-z0-9' </dev/urandom | head -c 5
The ID of the latest successful GCP managed backup is obtained using:
BACKUP_DATA=$(gcloud sql backups list \
--instance "$SOURCE_BACKUP_INSTANCE" \
--filter STATUS=SUCCESSFUL \
--limit 1 | sed 1,1d | tr -s ' ')
BACKUP_ID=$(echo "$BACKUP_DATA" | cut -d ' ' -f 1)
Whilst it's recommended to monitor for failed/successful cloud_sql_backup.sh
script executions, this can't be relied upon to report freshness of data. For example, your GCP managed backups may start failing, but the cloud_sql_backup.sh
script will keep succeeding, but with out-of-date data (based on the last successful GCP managed backup).
The penultimate task of the cloud_sql_backup.sh
script is to poll GCS using gsutil
to verify the object (SQL dump) has arrived in GCS as expected. This has to be performed out-of-band of the SQL dump process, as the dump is an operation that's triggered on the ephemeral db instance (using gcloud sql export sql
).
Contributions are very welcome. Please branch or fork, and submit a PR.
PRs from branches will result in an e2e test being run in CircleCI, which can sometimes take around 15mins. Commits from forked PRs should be made onto a branch in this repo, and another PR opened so the e2e test can run.