Files
CBDDC/docs/troubleshooting.md
Joseph Doherty ce727eb30d
All checks were successful
CI / verify (push) Successful in 2m33s
docs: align internal docs to enterprise standards
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
2026-02-20 13:23:55 -05:00

87 lines
2.1 KiB
Markdown

# Troubleshooting
This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.
## Peer Cannot Connect
Symptoms:
- Node remains disconnected from expected peers.
- Health check reports lagging or unconfirmed peers.
Likely causes:
- Network path blocked (port/firewall mismatch).
- `AuthToken` mismatch.
- Peer configuration drift.
Resolution:
1. Verify TCP/UDP port configuration on both peers.
2. Confirm shared token and node identity settings.
3. Restart peer service and monitor logs.
4. Recheck `cbddc` health payload.
## Replication Delay or Missing Updates
Symptoms:
- Writes are visible locally but not on remote peers.
- `maxLagMs` grows continuously.
Likely causes:
- Retired peer still tracked and gating pruning.
- High load or transient network instability.
- Invalid collection watch configuration.
Resolution:
1. Confirm affected collections are registered with `WatchCollection()`.
2. Inspect peer confirmation metrics.
3. If needed, de-track retired peers using the runbook.
4. Re-run smoke sync validation after changes.
## Persistence Errors
Symptoms:
- Startup or write failures from persistence layer.
- Unhealthy health check due to storage exceptions.
Likely causes:
- File/path permission errors.
- Storage corruption.
- Misconfigured provider settings.
Resolution:
1. Validate storage path and runtime permissions.
2. Run integrity checks.
3. Restore from latest good backup if corruption is detected.
4. Validate read/write and replication after restore.
## Configuration Regressions After Release
Symptoms:
- Behavior changed immediately after deployment.
- Multiple nodes fail with same error pattern.
Likely causes:
- Incorrect environment variables or appsettings values.
- Partial rollout with incompatible settings.
Resolution:
1. Compare deployed configuration to approved baseline.
2. Roll back to last known-good release if production impact is high.
3. Redeploy with corrected configuration.
4. Document root cause and preventive controls.
## Escalation
If a Sev 1/Sev 2 condition cannot be resolved quickly, follow [Runbook](runbook.md) escalation and incident procedures.