Files
CBDDC/docs/runbook.md
Joseph Doherty 8e97061ab8
All checks were successful
NuGet Package Publish / nuget (push) Successful in 1m14s
Implement in-process multi-dataset sync isolation across core, network, persistence, and tests
2026-02-22 11:58:34 -05:00

2.4 KiB

Operations Runbook

This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery.

Ownership and Escalation

  • Service owner: CBDDC Core Maintainers.
  • First response: local platform/application on-call team for the affected deployment.
  • Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload.

Alert Triage

  1. Identify severity based on impact:
    • Sev 1: Data integrity risk, sustained outage, or broad replication failure.
    • Sev 2: Partial sync degradation or prolonged peer lag.
    • Sev 3: Isolated node issue with workaround.
  2. Confirm current cbddc health check status and payload.
  3. Identify affected peers, collections, and first observed time.
  4. Apply the relevant recovery play below.

Core Diagnostics

Capture these artifacts before remediation:

  • Health response payload (trackedPeerCount, laggingPeers, peersWithNoConfirmation, maxLagMs).
  • Application logs for sync, persistence, and network components.
  • Current runtime configuration (excluding secrets).
  • Most recent deployment identifier and change window.

Multi-Dataset Gates

Before enabling telemetry datasets in production:

  1. Enable primary only and record baseline primary sync lag.
  2. Enable logs; confirm primary lag remains within SLO.
  3. Enable timeseries; confirm primary lag remains within SLO.
  4. If primary SLO regresses, disable telemetry datasets first before broader rollback.

Recovery Plays

Peer unreachable or lagging

  1. Verify network path and auth token consistency.
  2. Validate peer is still expected in topology.
  3. If peer is retired, follow Peer Deprecation and Removal Runbook.
  4. Recheck health status after remediation.

Persistence failure

  1. Verify storage path and permissions.
  2. Run integrity checks.
  3. Restore from latest valid backup if corruption is confirmed.
  4. Validate replication behavior after restore.

Configuration drift

  1. Compare deployed config to approved baseline.
  2. Reapply canonical settings.
  3. Restart affected service safely.
  4. Verify recovery with health and smoke checks.

Post-Incident Actions

  1. Record root cause and timeline.
  2. Add follow-up work items (tests, alerts, docs updates).
  3. Update affected feature docs and troubleshooting guidance.
  4. Confirm rollback and recovery instructions remain accurate.