docs: align internal docs to enterprise standards
All checks were successful
CI / verify (push) Successful in 2m33s
All checks were successful
CI / verify (push) Successful in 2m33s
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
This commit is contained in:
86
docs/troubleshooting.md
Normal file
86
docs/troubleshooting.md
Normal file
@@ -0,0 +1,86 @@
|
||||
# Troubleshooting
|
||||
|
||||
This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.
|
||||
|
||||
## Peer Cannot Connect
|
||||
|
||||
Symptoms:
|
||||
|
||||
- Node remains disconnected from expected peers.
|
||||
- Health check reports lagging or unconfirmed peers.
|
||||
|
||||
Likely causes:
|
||||
|
||||
- Network path blocked (port/firewall mismatch).
|
||||
- `AuthToken` mismatch.
|
||||
- Peer configuration drift.
|
||||
|
||||
Resolution:
|
||||
|
||||
1. Verify TCP/UDP port configuration on both peers.
|
||||
2. Confirm shared token and node identity settings.
|
||||
3. Restart peer service and monitor logs.
|
||||
4. Recheck `cbddc` health payload.
|
||||
|
||||
## Replication Delay or Missing Updates
|
||||
|
||||
Symptoms:
|
||||
|
||||
- Writes are visible locally but not on remote peers.
|
||||
- `maxLagMs` grows continuously.
|
||||
|
||||
Likely causes:
|
||||
|
||||
- Retired peer still tracked and gating pruning.
|
||||
- High load or transient network instability.
|
||||
- Invalid collection watch configuration.
|
||||
|
||||
Resolution:
|
||||
|
||||
1. Confirm affected collections are registered with `WatchCollection()`.
|
||||
2. Inspect peer confirmation metrics.
|
||||
3. If needed, de-track retired peers using the runbook.
|
||||
4. Re-run smoke sync validation after changes.
|
||||
|
||||
## Persistence Errors
|
||||
|
||||
Symptoms:
|
||||
|
||||
- Startup or write failures from persistence layer.
|
||||
- Unhealthy health check due to storage exceptions.
|
||||
|
||||
Likely causes:
|
||||
|
||||
- File/path permission errors.
|
||||
- Storage corruption.
|
||||
- Misconfigured provider settings.
|
||||
|
||||
Resolution:
|
||||
|
||||
1. Validate storage path and runtime permissions.
|
||||
2. Run integrity checks.
|
||||
3. Restore from latest good backup if corruption is detected.
|
||||
4. Validate read/write and replication after restore.
|
||||
|
||||
## Configuration Regressions After Release
|
||||
|
||||
Symptoms:
|
||||
|
||||
- Behavior changed immediately after deployment.
|
||||
- Multiple nodes fail with same error pattern.
|
||||
|
||||
Likely causes:
|
||||
|
||||
- Incorrect environment variables or appsettings values.
|
||||
- Partial rollout with incompatible settings.
|
||||
|
||||
Resolution:
|
||||
|
||||
1. Compare deployed configuration to approved baseline.
|
||||
2. Roll back to last known-good release if production impact is high.
|
||||
3. Redeploy with corrected configuration.
|
||||
4. Document root cause and preventive controls.
|
||||
|
||||
## Escalation
|
||||
|
||||
If a Sev 1/Sev 2 condition cannot be resolved quickly, follow [Runbook](runbook.md) escalation and incident procedures.
|
||||
Reference in New Issue
Block a user