This document is a Template for creating a custom CockroachDB Runbook, a.k.a. CockroachDB Operation Manual
A runbook is a reference document which describes a CockroachDB deployment in a specific application environment with related tasks, checklists, and operational procedures.
This template provides an overall structure and implementation outlines for common CockroachDB operating procedures, expediting the creation of a custom runbook - an important deliverable of the overall IT system to ensure a required state of preparedness.
Customers who already have a CockroachDB runbook can use this template to check their existing manual for completeness.
In practice, CockroachDB operators will strive to automate most of the checks and procedures. This template, however, is focused on documenting the detailed checklists and steps comprising individual operational procedures. The automation of these procedures is not in scope of this document.
CockroachDB Node is an instance of a cockroach server process. To underscore this point - a node is neither a [virtual] server nor an instance of an OS nor a container. Cockroach Labs strongly recommends running one CockroachDB node per one instance of an OS or per container.
CockroachDB Cluster is a set of connected CockroachDB Nodes that form a single system that works together on all tasks.
Platform is a set of compatible hardware, virtualized or containerized hardware, as well as related structures, on which CockroachDB can be run. Platform examples are bare metal x86_64, AWS EC2, Google Cloud Platform, Microsoft Azure, VMware vSphere, Docker, Kubernetes.
- Service or System Overview
- Business Overview
- Technical Overview
- Hardware Platform
- Virtualization or Containerization
- VM Configuration
- Operating system
- Clock Management
- Network Design
- Data Volumes
- Growth Rate
- Planned Capacity
- Cluster Right-Sizing, Expansion Strategy
- Cluster Topology and Configuration
- Auto-Scaling
- Application Connection Management
- Application Transactions Management
- Upstream Dependent Systems
- Downstream Dependent Systems
- Ecosystem Tools
- Deployment and Configuration management tools
- Routine Maintenance Procedures
- Open / Close database "gates"
- Force closing application connections
- Node Start
- Node Shutdown (Stop)
- Add a Node
- Remove (Decommission) Node
- Cluster Region Migration
- Cluster Resizing
- Server / VM Replacement
- Backup / Restore
- Change --max-offset
- Snapshot Rebalancing Rate
- Change Cluster Settings
- CockroachDB Version Upgrade
- The Most Common Problems experienced by CockroachDB users
- Monitoring and Alerting
- Monitoring tools
- Monitoring Metrics
(metrics to watch, alert rules, corrective actions)
- Node CPU
- Node CPU Anomaly
- Node Memory
- Node Storage Capacity
- Node Storage Performance
- Node Liveness (inconsistent liveness check)
- Node LSM Storage Health
- Live Node Count Change
- High Query Latency
- Intent Buildup
- Changefeed Falling Behind
- Changefeed Frequent Restarts
- Changefeed Stopped
- Non-Incrementing Uptime Counter
- Version Mismatch
- CA Expiry
- Alert Response Procedures
- Diagnostic and Support
- Emergency Procedures / Operation Continuity
- Monitoring Alerts deployed Cockroach Cloud managed Service: (common, dedicated, host)
- Including the 6 alerts delivered to users of Cockroach Cloud Dedicated
- Available Monitoring Metrics