docs: align internal docs to enterprise standards
All checks were successful
CI / verify (push) Successful in 2m33s

Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
This commit is contained in:
Joseph Doherty
2026-02-20 13:23:55 -05:00
parent e6d81f6350
commit ce727eb30d
18 changed files with 783 additions and 186 deletions

86
docs/troubleshooting.md Normal file
View File

@@ -0,0 +1,86 @@
# Troubleshooting
This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.
## Peer Cannot Connect
Symptoms:
- Node remains disconnected from expected peers.
- Health check reports lagging or unconfirmed peers.
Likely causes:
- Network path blocked (port/firewall mismatch).
- `AuthToken` mismatch.
- Peer configuration drift.
Resolution:
1. Verify TCP/UDP port configuration on both peers.
2. Confirm shared token and node identity settings.
3. Restart peer service and monitor logs.
4. Recheck `cbddc` health payload.
## Replication Delay or Missing Updates
Symptoms:
- Writes are visible locally but not on remote peers.
- `maxLagMs` grows continuously.
Likely causes:
- Retired peer still tracked and gating pruning.
- High load or transient network instability.
- Invalid collection watch configuration.
Resolution:
1. Confirm affected collections are registered with `WatchCollection()`.
2. Inspect peer confirmation metrics.
3. If needed, de-track retired peers using the runbook.
4. Re-run smoke sync validation after changes.
## Persistence Errors
Symptoms:
- Startup or write failures from persistence layer.
- Unhealthy health check due to storage exceptions.
Likely causes:
- File/path permission errors.
- Storage corruption.
- Misconfigured provider settings.
Resolution:
1. Validate storage path and runtime permissions.
2. Run integrity checks.
3. Restore from latest good backup if corruption is detected.
4. Validate read/write and replication after restore.
## Configuration Regressions After Release
Symptoms:
- Behavior changed immediately after deployment.
- Multiple nodes fail with same error pattern.
Likely causes:
- Incorrect environment variables or appsettings values.
- Partial rollout with incompatible settings.
Resolution:
1. Compare deployed configuration to approved baseline.
2. Roll back to last known-good release if production impact is high.
3. Redeploy with corrected configuration.
4. Document root cause and preventive controls.
## Escalation
If a Sev 1/Sev 2 condition cannot be resolved quickly, follow [Runbook](runbook.md) escalation and incident procedures.