Add enterprise docs structure and include pending core maintenance updates.

2026-02-20 13:28:29 -05:00
parent b8ed5ec500
commit 52445078a1
23 changed files with 1956 additions and 404 deletions
--- a/docs/runbook.md
+++ b/docs/runbook.md
@@ -0,0 +1,56 @@
+# Runbook
+
+## Purpose
+
+This runbook provides standard operations, incident triage, escalation, and recovery procedures for CBDD maintainers.
+
+## Signals And Entry Points
+
+- CI failures on `main`
+- Failing integration tests in consumer repositories
+- Regression issues labeled `incident`
+- Recovery or corruption reports from consumers
+
+## Alert Triage Procedure
+
+1. Capture incident context: version, environment, failing operation, and first failure timestamp.
+2. Classify severity:
+- `SEV-1`: data loss risk, persistent startup failure, or transaction correctness risk.
+- `SEV-2`: feature-level regression without confirmed data loss.
+- `SEV-3`: non-critical behavior or documentation defects.
+3. Create or update the incident issue with owner and current mitigation status.
+4. Reproduce with targeted tests in `/Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests`.
+
+## Diagnostics
+
+1. Validate build and tests.
+```bash
+dotnet test CBDD.slnx -c Release
+```
+2. Run coverage threshold gate when behavior changed in core paths.
+```bash
+bash scripts/coverage-check.sh
+```
+3. For storage and recovery incidents, prioritize:
+- `StorageEngine.Recovery`
+- `WriteAheadLog`
+- transaction protocol tests
+
+## Escalation Path
+
+1. Initial owner: maintainer on incident issue.
+2. Escalate to release maintainer when severity is `SEV-1` or rollback is required.
+3. Communicate status updates on each milestone: triage complete, mitigation active, fix merged, validation complete.
+
+## Recovery Actions
+
+1. Contain impact by pinning consumers to last known-good package version.
+2. Apply rollback steps from [`deployment.md`](deployment.md#rollback-procedure).
+3. Validate repaired build with targeted and full regression suites.
+4. Publish fixed package and confirm consumer recovery.
+
+## Post-Incident Expectations
+
+1. Document root cause, blast radius, and timeline.
+2. Add regression tests to prevent recurrence.
+3. Record follow-up actions in issue tracker with owners and due dates.