Add enterprise docs structure and include pending core maintenance updates.

This commit is contained in:
Joseph Doherty
2026-02-20 13:28:29 -05:00
parent b8ed5ec500
commit 52445078a1
23 changed files with 1956 additions and 404 deletions

56
docs/runbook.md Normal file
View File

@@ -0,0 +1,56 @@
# Runbook
## Purpose
This runbook provides standard operations, incident triage, escalation, and recovery procedures for CBDD maintainers.
## Signals And Entry Points
- CI failures on `main`
- Failing integration tests in consumer repositories
- Regression issues labeled `incident`
- Recovery or corruption reports from consumers
## Alert Triage Procedure
1. Capture incident context: version, environment, failing operation, and first failure timestamp.
2. Classify severity:
- `SEV-1`: data loss risk, persistent startup failure, or transaction correctness risk.
- `SEV-2`: feature-level regression without confirmed data loss.
- `SEV-3`: non-critical behavior or documentation defects.
3. Create or update the incident issue with owner and current mitigation status.
4. Reproduce with targeted tests in `/Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests`.
## Diagnostics
1. Validate build and tests.
```bash
dotnet test CBDD.slnx -c Release
```
2. Run coverage threshold gate when behavior changed in core paths.
```bash
bash scripts/coverage-check.sh
```
3. For storage and recovery incidents, prioritize:
- `StorageEngine.Recovery`
- `WriteAheadLog`
- transaction protocol tests
## Escalation Path
1. Initial owner: maintainer on incident issue.
2. Escalate to release maintainer when severity is `SEV-1` or rollback is required.
3. Communicate status updates on each milestone: triage complete, mitigation active, fix merged, validation complete.
## Recovery Actions
1. Contain impact by pinning consumers to last known-good package version.
2. Apply rollback steps from [`deployment.md`](deployment.md#rollback-procedure).
3. Validate repaired build with targeted and full regression suites.
4. Publish fixed package and confirm consumer recovery.
## Post-Incident Expectations
1. Document root cause, blast radius, and timeline.
2. Add regression tests to prevent recurrence.
3. Record follow-up actions in issue tracker with owners and due dates.