2.4 KiB
2.4 KiB
Operations Runbook
This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery.
Ownership and Escalation
- Service owner: CBDDC Core Maintainers.
- First response: local platform/application on-call team for the affected deployment.
- Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload.
Alert Triage
- Identify severity based on impact:
- Sev 1: Data integrity risk, sustained outage, or broad replication failure.
- Sev 2: Partial sync degradation or prolonged peer lag.
- Sev 3: Isolated node issue with workaround.
- Confirm current
cbddchealth check status and payload. - Identify affected peers, collections, and first observed time.
- Apply the relevant recovery play below.
Core Diagnostics
Capture these artifacts before remediation:
- Health response payload (
trackedPeerCount,laggingPeers,peersWithNoConfirmation,maxLagMs). - Application logs for sync, persistence, and network components.
- Current runtime configuration (excluding secrets).
- Most recent deployment identifier and change window.
Multi-Dataset Gates
Before enabling telemetry datasets in production:
- Enable
primaryonly and record baseline primary sync lag. - Enable
logs; confirm primary lag remains within SLO. - Enable
timeseries; confirm primary lag remains within SLO. - If primary SLO regresses, disable telemetry datasets first before broader rollback.
Recovery Plays
Peer unreachable or lagging
- Verify network path and auth token consistency.
- Validate peer is still expected in topology.
- If peer is retired, follow Peer Deprecation and Removal Runbook.
- Recheck health status after remediation.
Persistence failure
- Verify storage path and permissions.
- Run integrity checks.
- Restore from latest valid backup if corruption is confirmed.
- Validate replication behavior after restore.
Configuration drift
- Compare deployed config to approved baseline.
- Reapply canonical settings.
- Restart affected service safely.
- Verify recovery with health and smoke checks.
Post-Incident Actions
- Record root cause and timeline.
- Add follow-up work items (tests, alerts, docs updates).
- Update affected feature docs and troubleshooting guidance.
- Confirm rollback and recovery instructions remain accurate.