docs: align internal docs to enterprise standards
All checks were successful
CI / verify (push) Successful in 2m33s
All checks were successful
CI / verify (push) Successful in 2m33s
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
This commit is contained in:
58
docs/runbook.md
Normal file
58
docs/runbook.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# Operations Runbook
|
||||
|
||||
This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery.
|
||||
|
||||
## Ownership and Escalation
|
||||
|
||||
- Service owner: CBDDC Core Maintainers.
|
||||
- First response: local platform/application on-call team for the affected deployment.
|
||||
- Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload.
|
||||
|
||||
## Alert Triage
|
||||
|
||||
1. Identify severity based on impact:
|
||||
- Sev 1: Data integrity risk, sustained outage, or broad replication failure.
|
||||
- Sev 2: Partial sync degradation or prolonged peer lag.
|
||||
- Sev 3: Isolated node issue with workaround.
|
||||
2. Confirm current `cbddc` health check status and payload.
|
||||
3. Identify affected peers, collections, and first observed time.
|
||||
4. Apply the relevant recovery play below.
|
||||
|
||||
## Core Diagnostics
|
||||
|
||||
Capture these artifacts before remediation:
|
||||
|
||||
- Health response payload (`trackedPeerCount`, `laggingPeers`, `peersWithNoConfirmation`, `maxLagMs`).
|
||||
- Application logs for sync, persistence, and network components.
|
||||
- Current runtime configuration (excluding secrets).
|
||||
- Most recent deployment identifier and change window.
|
||||
|
||||
## Recovery Plays
|
||||
|
||||
### Peer unreachable or lagging
|
||||
|
||||
1. Verify network path and auth token consistency.
|
||||
2. Validate peer is still expected in topology.
|
||||
3. If peer is retired, follow [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md).
|
||||
4. Recheck health status after remediation.
|
||||
|
||||
### Persistence failure
|
||||
|
||||
1. Verify storage path and permissions.
|
||||
2. Run integrity checks.
|
||||
3. Restore from latest valid backup if corruption is confirmed.
|
||||
4. Validate replication behavior after restore.
|
||||
|
||||
### Configuration drift
|
||||
|
||||
1. Compare deployed config to approved baseline.
|
||||
2. Reapply canonical settings.
|
||||
3. Restart affected service safely.
|
||||
4. Verify recovery with health and smoke checks.
|
||||
|
||||
## Post-Incident Actions
|
||||
|
||||
1. Record root cause and timeline.
|
||||
2. Add follow-up work items (tests, alerts, docs updates).
|
||||
3. Update affected feature docs and troubleshooting guidance.
|
||||
4. Confirm rollback and recovery instructions remain accurate.
|
||||
Reference in New Issue
Block a user