CBDDC/docs/runbook.md

# Operations Runbook

This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery.

## Ownership and Escalation

- Service owner: CBDDC Core Maintainers.
- First response: local platform/application on-call team for the affected deployment.
- Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload.

## Alert Triage

1. Identify severity based on impact:
   - Sev 1: Data integrity risk, sustained outage, or broad replication failure.
   - Sev 2: Partial sync degradation or prolonged peer lag.
   - Sev 3: Isolated node issue with workaround.
2. Confirm current `cbddc` health check status and payload.
3. Identify affected peers, collections, and first observed time.
4. Apply the relevant recovery play below.

## Core Diagnostics

Capture these artifacts before remediation:

- Health response payload (`trackedPeerCount`, `laggingPeers`, `peersWithNoConfirmation`, `maxLagMs`).
- Application logs for sync, persistence, and network components.
- Current runtime configuration (excluding secrets).
- Most recent deployment identifier and change window.

## Recovery Plays

### Peer unreachable or lagging

1. Verify network path and auth token consistency.
2. Validate peer is still expected in topology.
3. If peer is retired, follow [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md).
4. Recheck health status after remediation.

### Persistence failure

1. Verify storage path and permissions.
2. Run integrity checks.
3. Restore from latest valid backup if corruption is confirmed.
4. Validate replication behavior after restore.

### Configuration drift

1. Compare deployed config to approved baseline.
2. Reapply canonical settings.
3. Restart affected service safely.
4. Verify recovery with health and smoke checks.

## Post-Incident Actions

1. Record root cause and timeline.
2. Add follow-up work items (tests, alerts, docs updates).
3. Update affected feature docs and troubleshooting guidance.
4. Confirm rollback and recovery instructions remain accurate.