Add enterprise docs structure and include pending core maintenance updates.
This commit is contained in:
56
docs/runbook.md
Normal file
56
docs/runbook.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# Runbook
|
||||
|
||||
## Purpose
|
||||
|
||||
This runbook provides standard operations, incident triage, escalation, and recovery procedures for CBDD maintainers.
|
||||
|
||||
## Signals And Entry Points
|
||||
|
||||
- CI failures on `main`
|
||||
- Failing integration tests in consumer repositories
|
||||
- Regression issues labeled `incident`
|
||||
- Recovery or corruption reports from consumers
|
||||
|
||||
## Alert Triage Procedure
|
||||
|
||||
1. Capture incident context: version, environment, failing operation, and first failure timestamp.
|
||||
2. Classify severity:
|
||||
- `SEV-1`: data loss risk, persistent startup failure, or transaction correctness risk.
|
||||
- `SEV-2`: feature-level regression without confirmed data loss.
|
||||
- `SEV-3`: non-critical behavior or documentation defects.
|
||||
3. Create or update the incident issue with owner and current mitigation status.
|
||||
4. Reproduce with targeted tests in `/Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests`.
|
||||
|
||||
## Diagnostics
|
||||
|
||||
1. Validate build and tests.
|
||||
```bash
|
||||
dotnet test CBDD.slnx -c Release
|
||||
```
|
||||
2. Run coverage threshold gate when behavior changed in core paths.
|
||||
```bash
|
||||
bash scripts/coverage-check.sh
|
||||
```
|
||||
3. For storage and recovery incidents, prioritize:
|
||||
- `StorageEngine.Recovery`
|
||||
- `WriteAheadLog`
|
||||
- transaction protocol tests
|
||||
|
||||
## Escalation Path
|
||||
|
||||
1. Initial owner: maintainer on incident issue.
|
||||
2. Escalate to release maintainer when severity is `SEV-1` or rollback is required.
|
||||
3. Communicate status updates on each milestone: triage complete, mitigation active, fix merged, validation complete.
|
||||
|
||||
## Recovery Actions
|
||||
|
||||
1. Contain impact by pinning consumers to last known-good package version.
|
||||
2. Apply rollback steps from [`deployment.md`](deployment.md#rollback-procedure).
|
||||
3. Validate repaired build with targeted and full regression suites.
|
||||
4. Publish fixed package and confirm consumer recovery.
|
||||
|
||||
## Post-Incident Expectations
|
||||
|
||||
1. Document root cause, blast radius, and timeline.
|
||||
2. Add regression tests to prevent recurrence.
|
||||
3. Record follow-up actions in issue tracker with owners and due dates.
|
||||
Reference in New Issue
Block a user