docs: align internal docs to enterprise standards

Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
2026-02-20 13:23:55 -05:00
parent e6d81f6350
commit ce727eb30d
18 changed files with 783 additions and 186 deletions
--- a/docs/runbook.md
+++ b/docs/runbook.md
@@ -0,0 +1,58 @@
+# Operations Runbook
+
+This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery.
+
+## Ownership and Escalation
+
+- Service owner: CBDDC Core Maintainers.
+- First response: local platform/application on-call team for the affected deployment.
+- Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload.
+
+## Alert Triage
+
+1. Identify severity based on impact:
+   - Sev 1: Data integrity risk, sustained outage, or broad replication failure.
+   - Sev 2: Partial sync degradation or prolonged peer lag.
+   - Sev 3: Isolated node issue with workaround.
+2. Confirm current `cbddc` health check status and payload.
+3. Identify affected peers, collections, and first observed time.
+4. Apply the relevant recovery play below.
+
+## Core Diagnostics
+
+Capture these artifacts before remediation:
+
+- Health response payload (`trackedPeerCount`, `laggingPeers`, `peersWithNoConfirmation`, `maxLagMs`).
+- Application logs for sync, persistence, and network components.
+- Current runtime configuration (excluding secrets).
+- Most recent deployment identifier and change window.
+
+## Recovery Plays
+
+### Peer unreachable or lagging
+
+1. Verify network path and auth token consistency.
+2. Validate peer is still expected in topology.
+3. If peer is retired, follow [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md).
+4. Recheck health status after remediation.
+
+### Persistence failure
+
+1. Verify storage path and permissions.
+2. Run integrity checks.
+3. Restore from latest valid backup if corruption is confirmed.
+4. Validate replication behavior after restore.
+
+### Configuration drift
+
+1. Compare deployed config to approved baseline.
+2. Reapply canonical settings.
+3. Restart affected service safely.
+4. Verify recovery with health and smoke checks.
+
+## Post-Incident Actions
+
+1. Record root cause and timeline.
+2. Add follow-up work items (tests, alerts, docs updates).
+3. Update affected feature docs and troubleshooting guidance.
+4. Confirm rollback and recovery instructions remain accurate.