All checks were successful
CI / verify (push) Successful in 2m33s
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
2.1 KiB
2.1 KiB
Operations Runbook
This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery.
Ownership and Escalation
- Service owner: CBDDC Core Maintainers.
- First response: local platform/application on-call team for the affected deployment.
- Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload.
Alert Triage
- Identify severity based on impact:
- Sev 1: Data integrity risk, sustained outage, or broad replication failure.
- Sev 2: Partial sync degradation or prolonged peer lag.
- Sev 3: Isolated node issue with workaround.
- Confirm current
cbddchealth check status and payload. - Identify affected peers, collections, and first observed time.
- Apply the relevant recovery play below.
Core Diagnostics
Capture these artifacts before remediation:
- Health response payload (
trackedPeerCount,laggingPeers,peersWithNoConfirmation,maxLagMs). - Application logs for sync, persistence, and network components.
- Current runtime configuration (excluding secrets).
- Most recent deployment identifier and change window.
Recovery Plays
Peer unreachable or lagging
- Verify network path and auth token consistency.
- Validate peer is still expected in topology.
- If peer is retired, follow Peer Deprecation and Removal Runbook.
- Recheck health status after remediation.
Persistence failure
- Verify storage path and permissions.
- Run integrity checks.
- Restore from latest valid backup if corruption is confirmed.
- Validate replication behavior after restore.
Configuration drift
- Compare deployed config to approved baseline.
- Reapply canonical settings.
- Restart affected service safely.
- Verify recovery with health and smoke checks.
Post-Incident Actions
- Record root cause and timeline.
- Add follow-up work items (tests, alerts, docs updates).
- Update affected feature docs and troubleshooting guidance.
- Confirm rollback and recovery instructions remain accurate.