docs: align internal docs to enterprise standards

Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
2026-02-20 13:23:55 -05:00
parent e6d81f6350
commit ce727eb30d
18 changed files with 783 additions and 186 deletions
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -0,0 +1,86 @@
+# Troubleshooting
+
+This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.
+
+## Peer Cannot Connect
+
+Symptoms:
+
+- Node remains disconnected from expected peers.
+- Health check reports lagging or unconfirmed peers.
+
+Likely causes:
+
+- Network path blocked (port/firewall mismatch).
+- `AuthToken` mismatch.
+- Peer configuration drift.
+
+Resolution:
+
+1. Verify TCP/UDP port configuration on both peers.
+2. Confirm shared token and node identity settings.
+3. Restart peer service and monitor logs.
+4. Recheck `cbddc` health payload.
+
+## Replication Delay or Missing Updates
+
+Symptoms:
+
+- Writes are visible locally but not on remote peers.
+- `maxLagMs` grows continuously.
+
+Likely causes:
+
+- Retired peer still tracked and gating pruning.
+- High load or transient network instability.
+- Invalid collection watch configuration.
+
+Resolution:
+
+1. Confirm affected collections are registered with `WatchCollection()`.
+2. Inspect peer confirmation metrics.
+3. If needed, de-track retired peers using the runbook.
+4. Re-run smoke sync validation after changes.
+
+## Persistence Errors
+
+Symptoms:
+
+- Startup or write failures from persistence layer.
+- Unhealthy health check due to storage exceptions.
+
+Likely causes:
+
+- File/path permission errors.
+- Storage corruption.
+- Misconfigured provider settings.
+
+Resolution:
+
+1. Validate storage path and runtime permissions.
+2. Run integrity checks.
+3. Restore from latest good backup if corruption is detected.
+4. Validate read/write and replication after restore.
+
+## Configuration Regressions After Release
+
+Symptoms:
+
+- Behavior changed immediately after deployment.
+- Multiple nodes fail with same error pattern.
+
+Likely causes:
+
+- Incorrect environment variables or appsettings values.
+- Partial rollout with incompatible settings.
+
+Resolution:
+
+1. Compare deployed configuration to approved baseline.
+2. Roll back to last known-good release if production impact is high.
+3. Redeploy with corrected configuration.
+4. Document root cause and preventive controls.
+
+## Escalation
+
+If a Sev 1/Sev 2 condition cannot be resolved quickly, follow [Runbook](runbook.md) escalation and incident procedures.