# Operations Runbook This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery. ## Ownership and Escalation - Service owner: CBDDC Core Maintainers. - First response: local platform/application on-call team for the affected deployment. - Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload. ## Alert Triage 1. Identify severity based on impact: - Sev 1: Data integrity risk, sustained outage, or broad replication failure. - Sev 2: Partial sync degradation or prolonged peer lag. - Sev 3: Isolated node issue with workaround. 2. Confirm current `cbddc` health check status and payload. 3. Identify affected peers, collections, and first observed time. 4. Apply the relevant recovery play below. ## Core Diagnostics Capture these artifacts before remediation: - Health response payload (`trackedPeerCount`, `laggingPeers`, `peersWithNoConfirmation`, `maxLagMs`). - Application logs for sync, persistence, and network components. - Current runtime configuration (excluding secrets). - Most recent deployment identifier and change window. ## Recovery Plays ### Peer unreachable or lagging 1. Verify network path and auth token consistency. 2. Validate peer is still expected in topology. 3. If peer is retired, follow [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md). 4. Recheck health status after remediation. ### Persistence failure 1. Verify storage path and permissions. 2. Run integrity checks. 3. Restore from latest valid backup if corruption is confirmed. 4. Validate replication behavior after restore. ### Configuration drift 1. Compare deployed config to approved baseline. 2. Reapply canonical settings. 3. Restart affected service safely. 4. Verify recovery with health and smoke checks. ## Post-Incident Actions 1. Record root cause and timeline. 2. Add follow-up work items (tests, alerts, docs updates). 3. Update affected feature docs and troubleshooting guidance. 4. Confirm rollback and recovery instructions remain accurate.