[Infrastructure] Add Cloud Monitoring
abmarcum opened this issue · 4 comments
Add GCP Cloud Monitoring to the project to alert on service availability.
Use Terraform to create the following:
Dashboard
Uptime checks for Endpoints
Service Availability - GKE, Redis, Spanner, Endpoints
Make monitoring optional - add enable true/false flag.
Variables alert notifications email address & place in terraform.tfvars.sample
Add additional monitoring checks as they are suggested.
Fun question: Cloud Monitoring or managed Prometheus, or both???
Cloud Monitoring can handle all GCP resources and most of the standard GKE metrics are available in Cloud Monitoring.
But for game specific/GKE workloads, Prometheus might be a better choice.
The question then becomes: do you want to manage both?
I would suggest a 2 phase approach: Get critical systems into Cloud Monitoring so that core systems are alerting on any issues. This is straight forward and all we need to determine is what we alert on. Then as Game monitoring requirements arise, we look at if they can work in Cloud Monitoring or if Prometheus is a better approach.
My 2 cents.
The question then becomes: do you want to manage both?
My thought was more - some would like Cloud Monitoring, some would like managed Prometheus. I've seen both in the wild.
I'm not sure how helpful this is for you, but the gcloud command shown is a 'fully loaded' one that I often use. It has absolutely all the bells and whistles turned on for GKE in the monitoring, logging, resource monitoring (aka 'cost monitoring), and notifications areas, including turning on monitoring for google-managed k8s controlplane components.
The names of the gcloud feature flags (and their corresponding values) are essentially a 1-1 mapping to the key/value pairs that the GKE terraform module uses, so hopefully this helps save some time stubbing out something here
gcloud beta container --project ${PROJECT_ID} clusters create ${CLUSTER_NAME} --region ${REGION} --no-enable-basic-auth --release-channel "rapid" --machine-type "e2-highcpu-4" --image-type "COS_CONTAINERD" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --num-nodes "2" --enable-autoscaling --min-nodes "0" --max-nodes "3" --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --max-pods-per-node "110" --enable-private-nodes --master-ipv4-cidr "172.16.0.0/28" --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/config-admin-vpc" --subnetwork "projects/${PROJECT_ID}/regions/${REGION}/subnetworks/config-admin-vpc" --cluster-ipv4-cidr "192.168.0.0/16" --services-ipv4-cidr "192.169.0.0/16" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --enable-dataplane-v2 --enable-master-authorized-networks --master-authorized-networks 0.0.0.0/0 --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver,ConfigConnector,BackupRestore --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --labels mesh_id=proj-${PROJECT_NUMBER} --resource-usage-bigquery-dataset ${BQ_DATASET_NAME} --enable-resource-consumption-metering --workload-pool "${PROJECT_ID}.svc.id.goog" --enable-shielded-nodes --security-group "gke-security-groups@${GOOGLE_ADMIN_DOMAIN}" --notification-config=pubsub=ENABLED,pubsub-topic=projects/${PROJECT_ID}/topics/${ARGOCD_PUBSUB_TOPIC} --enable-image-streaming --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER --enable-managed-prometheus --enable-workload-config-audit