57 lines
1.9 KiB
Markdown
57 lines
1.9 KiB
Markdown
# Runbook
|
|
|
|
## Purpose
|
|
|
|
This runbook provides standard operations, incident triage, escalation, and recovery procedures for CBDD maintainers.
|
|
|
|
## Signals And Entry Points
|
|
|
|
- CI failures on `main`
|
|
- Failing integration tests in consumer repositories
|
|
- Regression issues labeled `incident`
|
|
- Recovery or corruption reports from consumers
|
|
|
|
## Alert Triage Procedure
|
|
|
|
1. Capture incident context: version, environment, failing operation, and first failure timestamp.
|
|
2. Classify severity:
|
|
- `SEV-1`: data loss risk, persistent startup failure, or transaction correctness risk.
|
|
- `SEV-2`: feature-level regression without confirmed data loss.
|
|
- `SEV-3`: non-critical behavior or documentation defects.
|
|
3. Create or update the incident issue with owner and current mitigation status.
|
|
4. Reproduce with targeted tests in `/Users/dohertj2/Desktop/CBDD/tests/CBDD.Tests`.
|
|
|
|
## Diagnostics
|
|
|
|
1. Validate build and tests.
|
|
```bash
|
|
dotnet test CBDD.slnx -c Release
|
|
```
|
|
2. Run coverage threshold gate when behavior changed in core paths.
|
|
```bash
|
|
bash scripts/coverage-check.sh
|
|
```
|
|
3. For storage and recovery incidents, prioritize:
|
|
- `StorageEngine.Recovery`
|
|
- `WriteAheadLog`
|
|
- transaction protocol tests
|
|
|
|
## Escalation Path
|
|
|
|
1. Initial owner: maintainer on incident issue.
|
|
2. Escalate to release maintainer when severity is `SEV-1` or rollback is required.
|
|
3. Communicate status updates on each milestone: triage complete, mitigation active, fix merged, validation complete.
|
|
|
|
## Recovery Actions
|
|
|
|
1. Contain impact by pinning consumers to last known-good package version.
|
|
2. Apply rollback steps from [`deployment.md`](deployment.md#rollback-procedure).
|
|
3. Validate repaired build with targeted and full regression suites.
|
|
4. Publish fixed package and confirm consumer recovery.
|
|
|
|
## Post-Incident Expectations
|
|
|
|
1. Document root cause, blast radius, and timeline.
|
|
2. Add regression tests to prevent recurrence.
|
|
3. Record follow-up actions in issue tracker with owners and due dates.
|